Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Architecture Tutorial - Platform Component Architecture (2of3)


Published on

This is the meat of the presentation, it describes in detail how do use anti-architecture to define what gets done, then discusses patterns, type systems, PaaS frameworks, services and components. There is a detailed explanation of Cassandra as a data store and open source components.

Published in: Technology, News & Politics

Cloud Architecture Tutorial - Platform Component Architecture (2of3)

  1. 1. Cloud  Architecture  Tutorial   Pla$orm  Component  Architecture     Part  2  of  3 Qcon  London  March  5th,  2012   Adrian  Cockcro?   @adrianco  #ne$lixcloud   hCp://  
  2. 2. Don’t  Do  That!  A  Discussion  of  AnM-­‐Architecture   (wriCen  as  an  Ignite  talk)  
  3. 3. Architecture  PaCerns  to  guide  detailed   design  and  construcMon  
  4. 4. AnM-­‐Architecture  Constraints  that  limit  detailed   design  and  construcMon  
  5. 5. Misplaced  Enthusiasm  
  6. 6. How  could  that  happen?  
  7. 7. Anatomy  of  a  Failure  
  8. 8. What  I  Wanted  •  Moving  to  Cassandra  as  primary  data  store  •  We  need  backups!  •  We  are  running  on  AWS…     I  want  Cassandra  backups  to  S3   Start  with  full  backup,  incremental  later   Restore  to  a  different  Cassandra  cluster  
  9. 9. AddiMonal  Goals  I  would  like  it  next  week  -­‐  Keep  it  simple   No  single  point  of  failure  Get  once  a  day  full  backup  working  first  
  10. 10. Prototype  •  Created  S3  bucket  •  Carefully  figured  out  a  good  S3  path  hierarchy  •  Wrote  a  simple  backup  script  •  Added  it  to  cron  •  ….  •  Profit!  (total  Mme  half  a  day)  
  11. 11. Now  comes  the  hard  part!  Restore  is  trickier,  Cassandra  is  wriCen  in  Java,   programmer  from  another  team  takes  over…   Here’s  the  S3  bucket,  backups  are  being   collected  already,  please  figure  out  how  to   restore  it.  Done  by  next  week  perhaps?  
  12. 12. Days  Pass…  •  Programmer  is  re-­‐wriMng  backup  in  python  •  Installs  Python  2.7  on  CentOS,  breaks  yum  •  Backup  remotely  invoked  from  a  central  point  •  Cassandra  patched  to  do  incremental  backups  
  13. 13. Weeks  Pass…  •  Python  based  full  backup  &  restore  works!  •  But  only  to  the  Cassandra  cluster  it  came  from  •  Incremental  backup  works!  •  Restore  not  done  yet…  
  14. 14. Cassandra  in  ProducMon   We  do  have  backups  running  now,  right?   We’ll  get  right  on  it…  I  want  the  producKon  backup  restored  in  test.   Oh,  didn’t  implement  that  feature  yet…  
  15. 15. Whoops!  ProducMon  data  trashed  while  sefng  up  backup  Luckily  –  it  was  recoverable  from  elsewhere  
  16. 16. Months  Pass  •  Python  prototype  re-­‐wriCen  in  Java  (Priam)  •  Integrated  with  other  management  funcMons  •  Decentralized  backups  again  (yay!)  •  Reliable  backups  •  Restore  to  test  •  Not  simple  •  Took  too  long…  
  17. 17. AnM-­‐Architecture  •  Define  the  things  you  don’t  want  •  Constrain  the  outcome  •  Check  that  the  constraints  are  being  met  •  …  •  Profit!  
  18. 18. AnM-­‐Architecture  Success  hCp://$­‐ne$lix-­‐learned-­‐from-­‐aws-­‐outage.html    
  19. 19. AnM-­‐Architecture  Define  the  space  the  thing  will  inhabit       (All  pictures  in  this  secMon  were  found   on  google  images)  
  20. 20. Cloud  Architecture  PaCerns   Where  do  we  start?  
  21. 21. Goals  •  Faster   –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls   –  Measured  as  mean  and  99th  percenMle   –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  •  Scalable   –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases   –  No  central  verMcally  scaled  databases   –  Leverage  AWS  elasMc  capacity  effecMvely  •  Available   –  SubstanMally  higher  robustness  and  availability  than  datacenter  services   –  Leverage  mulMple  AWS  availability  zones   –  No  scheduled  down  Mme,  no  central  database  schema  to  change  •  ProducMve   –  OpMmize  agility  of  a  large  development  team  with  automaMon  and  tools   –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)   –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  
  22. 22. Datacenter  AnM-­‐PaCerns   What  do  we  currently  do  in  the  datacenter  that  prevents  us  from   meeMng  our  goals?    
  23. 23. Architecture  •  So?ware  Architecture   –  The  abstracMons  and  interfaces  that  developers  build   against  •  Systems  Architecture   –  The  service  instances  that  define  availability,   scalability  •  Compose-­‐ability   –  so?ware  architecture  that  is  independent  of  the   systems  architecture   –  decoupled  flexible  building  block  components    
  24. 24. Rewrite  from  Scratch  Not  everything  is  cloud  specific   Pay  down  technical  debt   Robust  paCerns  
  25. 25. Ne$lix  Datacenter  vs.  Cloud  Arch   Central  SQL  Database   Distributed  Key/Value  NoSQL  SMcky  In-­‐Memory  Session   Shared  Memcached  Session   ChaCy  Protocols   Latency  Tolerant  Protocols  Tangled  Service  Interfaces   Layered  Service  Interfaces   Instrumented  Code   Instrumented  Service  PaCerns   Fat  Complex  Objects   Lightweight  Serializable  Objects   Components  as  Jar  Files   Components  as  Services  
  26. 26. The  Central  SQL  Database  •  Datacenter  has  a  central  database   –  Everything  in  one  place  is  convenient  unMl  it  fails   –  Customers,  movies,  history,  configuraMon  •  Schema  changes  require  downMme     AnK-­‐paMern  impacts  scalability,  availability  
  27. 27. The  Distributed  Key-­‐Value  Store  •  Cloud  has  many  key-­‐value  data  stores   –  More  complex  to  keep  track  of,  do  backups  etc.   –  Each  store  is  much  simpler  to  administer   DBA   –  Joins  take  place  in  java  code   –  No  schema  to  change,  no  scheduled  downMme  •  Mean  Latency  for  Simple  Key  Lookup  Queries   –  Memcached  is  dominated  by  network  latency  <1ms   –  Cassandra  around  one  millisecond   –  Oracle  for  simple  queries  is  a  few  milliseconds   –  DynamoDB  around  5ms   –  SimpleDB  replicaMon  and  REST  overheads  >10ms  
  28. 28. The  SMcky  Session  •  Datacenter  SMcky  Load  Balancing   –  Efficient  caching  for  low  latency   –  Tricky  session  handling  code  •  Encourages  concentrated  funcMonality   –  one  service  that  does  everything   –  Middle  Mer  load  balancer  had  issues  in  pracMce     AnK-­‐paMern  impacts  producKvity,  availability  
  29. 29. Shared  Session  State  •  ElasMc  Load  Balancer     –  We  don’t  use  the  cookie  based  rouMng  opMon   –  External  “session  caching”  with  memcached  •  More  flexible  fine  grain  services   –  Any  instance  can  serve  any  request   –  Works  beCer  with  auto-­‐scaled  instance  counts  
  30. 30. ChaCy  Opaque  and  BriCle  Protocols  •  Datacenter  service  protocols   –  Assumed  low  latency  for  many  simple  requests  •  Based  on  serializing  exisMng  java  objects   –  Inefficient  formats   –  IncompaMble  when  definiMons  change     AnK-­‐paMern  causes  producKvity,  latency  and   availability  issues  
  31. 31. Robust  and  Flexible  Protocols  •  Cloud  service  protocols   –  JSR311/Jersey  is  used  for  REST/HTTP  service  calls   –  Custom  client  code  includes  service  discovery   –  Support  complex  data  types  in  a  single  request  •  Apache  Avro   –  Evolved  from  Protocol  Buffers  and  Thri?   –  Includes  JSON  header  defining  key/value  protocol   –  Avro  serializaMon  is  half  the  size  and  several  Mmes   faster  than  Java  serializaMon,  more  work  to  code  
  32. 32. Persisted  Protocols  •  Persist  Avro  in  Memcached   –  Save  space/latency  (zigzag  encoding,  half  the  size)   –  New  keys  are  ignored   –  Missing  keys  are  handled  cleanly  •  Avro  protocol  definiMons   –  Less  briCle  across  versions   –  Can  be  wriCen  in  JSON  or  generated  from  POJOs   –  It’s  hard,  needs  beCer  tooling  
  33. 33. Tangled  Service  Interfaces  •  Datacenter  implementaMon  is  exposed   –  Oracle  SQL  queries  mixed  into  business  logic  •  Tangled  code   –  Deep  dependencies,  false  sharing  •  Data  providers  with  sideways  dependencies   –  Everything  depends  on  everything  else   AnK-­‐paMern  affects  producKvity,  availability  
  34. 34. Untangled  Service  Interfaces  •  New  Cloud  Code  With  Strict  Layering   –  Compile  against  interface  jar   –  Can  use  spring  runMme  binding  to  enforce   –  Fine  grain  services  as  components  •  Service  interface  is  the  service   –  ImplementaMon  is  completely  hidden   –  Can  be  implemented  locally  or  remotely   –  ImplementaMon  can  evolve  independently  
  35. 35. Untangled  Service  Interfaces  Two  layers:  •  SAL  -­‐  Service  Access  Library   –  Basic  serializaMon  and  error  handling   –  REST  or  POJO’s  defined  by  data  provider  •  ESL  -­‐  Extended  Service  Library   –  Caching,  conveniences,  can  combine  several  SALs   –  Exposes  faceted  type  system  (described  later)   –  Interface  defined  by  data  consumer  in  many  cases  
  36. 36. Service  InteracMon  PaCern   Sample  Swimlane  Diagram  
  37. 37. Service  Architecture  PaCerns  •  Internal  Interfaces  Between  Services   –  Common  paCerns  as  templates   –  Highly  instrumented,  observable,  analyMcs   –  Service  Level  Agreements  –  SLAs  •  Library  templates  for  generic  features   –  Instrumented  Ne$lix  Base  Servlet  template   –  Instrumented  generic  client  interface  template   –  Instrumented  S3,  SimpleDB,  Memcached  clients  
  38. 38. CLIENT   Request  Start   Timestamp,   Client   Inbound   Request  End   outbound   deserialize  end   Timestamp   serialize  start   Mmestamp   Mmestamp   Inbound   Client   deserialize   outbound   start   serialize  end   Mmestamp   Mmestamp  Client  network   receive   Mmestamp   Service  Request   Client  Network   send   Mmestamp   Instruments  Every   Service  network  send   Mmestamp   Step  in  the  call   Service   Network   receive   Mmestamp   Service   Service   outbound   inbound   serialize  end   serialize  start   Mmestamp   Mmestamp   Service   Service   outbound   inbound   serialize  start   SERVICE  execute   serialize  end   request  start   Mmestamp   Mmestamp   Mmestamp,   execute  request   end  Mmestamp  
  39. 39. Boundary  Interfaces  •  Isolate  teams  from  external  dependencies   –  Fake  SAL  built  by  cloud  team   –  Real  SAL  provided  by  data  provider  team  later   –  ESL  built  by  cloud  team  using  faceted  objects  •  Fake  data  sources  allow  development  to  start   –  e.g.  Fake  IdenMty  SAL  for  a  test  set  of  customers   –  Development  solidifies  dependencies  early   –  Helps  external  team  provide  the  right  interface  
  40. 40. One  Object  That  Does  Everything  •  Datacenter  uses  a  few  big  complex  objects   –  Movie  and  Customer  objects  are  the  foundaMon   –  Good  choice  for  a  small  team  and  one  instance   –  ProblemaMc  for  large  teams  and  many  instances  •  False  sharing  causes  tangled  dependencies   –  UnproducMve  re-­‐integraMon  work     AnK-­‐paMern  impacKng  producKvity  and   availability  
  41. 41. An  Interface  For  Each  Component  •  Cloud  uses  faceted  Video  and  Visitor   –  Basic  types  hold  only  the  idenMfier   –  Facets  scope  the  interface  you  actually  need   –  Each  component  can  define  its  own  facets  •  No  false-­‐sharing  and  dependency  chains   –  Type  manager  converts  between  facets  as  needed   –  video.asA(PresentaMonVideo)  for  www   –  video.asA(MerchableVideo)  for  middle  Mer  
  42. 42. Basic  Types  Epistemology  and  Design   By  Stan  Lanning  
  43. 43. Avoiding  “Level  Confusion”  [Catataxis]    •  Business  Level  Objects  (BLO?)   –  Customers,  Movies,  etc   –  Conceptual:  Exist  only  between  the  ears  •  Abstract  Types   –  AbstracMons  that  try  to  model  aspects  of  the  business   level  objects   –  O?en  captured  by  Java  interfaces  •  ImplementaMons   –  Specific  coded  implementaMons  of  the  abstract  types   –  Java  class,  or  a  collecMon  of  rows  in  a  database…  
  44. 44. Facets  •  No  single  Abstract  Type  captures  everything   about  a  BLO   –  Different  teams  see  different  “facets”   •  Customer:  Account  status;    Billing  history;  Viewing   history;  A/B  test  assignments   •  Movie:  Availability;  Popularity;  Synopsis;  Cast   –  Loosely  coupled,  Mghtly  aligned(!)  •  All  facets  for  a  BLO  should  inherit  from  one   “basic”  type  that  has  minimal  behavior  
  45. 45. Basic  Types  •  Module  external  interfaces  deal  in  basic  types;   internal  calls  are  free  to  use  more  complex   facets  •  Generic  machinery  to  switch  between  facets   Business  Level  Object   Java  Basic  Type   Movie  (TV  show…)   Video   Customer   Visitor   Category   VTag   Country   ISOCountry  
  46. 46. Type  Manager  •  Holds  the  “factory”  objects  that  manage   instances  of  facets   –  Typically  one  factory  per  facet   –  Factories  free  to  implement  any  instance   management  policy  they  want  •  Factories  register  with  the  Type  Manager   –  callers  never  interact  directly  with  the  factories   –  Mock  managers?  
  47. 47. Switching  Facets  •  Each  Basic  Type  B  implements  a  method  that   uses  the  Type  Manager  to  find  facet   implementaMons  of  the  same  BLO          <T extends B> T asA(Class<T> c)!•  Example:        Visitor visitor = xxx;
 ABClient abClient = visitor.asA(ABClient.class);
 assert(visitor.equals(abClient));!•  Look  Ma,  no  cast!   –  Facets  are  equal,  but  not  necessarily  ==.  
  48. 48. IDs!  (huh)  What  are  they  good  for?  •  IDs  exist  because  implementaKons  need  to   externalize  objects  and  maintain  their  idenKty   –  Persist  in  a  DB,  or  talk  to  a  remote  service   –  Different  implementaMons  of  a  type  of  BLO  model   the  same  object  iff  they  have  the  same  ID   –  Basic  Types  use  IDs  to  manage  facets,  determine   equality,  etc    
  49. 49. ConverMng  IDs  ßàObjects  Long id = xx;!MyVisitor visitor =! TypeManager.findObject(Visitor.class, id)! .asA(MyVisitor.class);!assert(id.equals(visitor.getId());!// Or more efficiently…!MyVisitor visitor2 =! TypeManager.findObject(Visitor.class, id,! MyVisitor.class);!// There are also efficient bulk conversion methods!Collection<Long> ids = xxx;!List<MyVisitor> visitors =! TypeManager.findObjects(Visitor.class, ids,! MyVisitor.class);!!  
  50. 50. Stan’s  Soap  Box  •  Don’t  pass  around  IDs  when  you  mean  to  refer   to  the  BLO;  that  is  Level  Confusion  •  Using  Basic  Types  helps  the  compiler  help  you;   compile  Mme  problems  are  beCer  than  run   Mme  problems  •  More  readable  by  people,  but  beware  that   asA  operaMons  may  be  a  lot  of  work  •  (Is  this  a  way  to  approximate  mulMple-­‐ inheritance  in  Java?)  
  51. 51. So?ware  Architecture  PaCerns  •  Object  Models   –  Basic  and  derived  types,  facets,  serializable   –  Pass  by  reference  within  a  service   –  Pass  by  value  between  services  •  ComputaMon  and  I/O  Models   –  Service  ExecuMon  using  Best  Effort  /  Futures   –  Common  thread  pool  management   –  Circuit  breakers  to  manage  and  contain  failures  
  52. 52. Model  Driven  Architecture  •  TradiMonal  Datacenter  PracMces   –  Lots  of  unique  hand-­‐tweaked  systems   –  Hard  to  enforce  paCerns   –  Some  use  of  Puppet  to  automate  changes  •  Model  Driven  Cloud  Architecture   –  Perforce/Ivy/Jenkins  based  builds  for  everything   –  Every  producMon  instance  is  a  pre-­‐baked  AMI   –  Every  applicaMon  is  managed  by  an  Autoscaler   Every  change  is  a  new  AMI  
  53. 53. Ne$lix  Cloud  Pla$orm   Ne$lix  ApplicaMons   Ne$lix  Cloud  Pla$orm  /  PaaS  AWS  Specific   Partner   Ne$lix  Legacy   Code   Interfaces   Datacenter  AWS  Services   Partner  Services   Services  
  54. 54. Ne$lix  PaaS  Principles  •  Maximum  FuncMonality   –  Developer  producMvity  and  agility  •  Leverage  as  much  of  AWS  as  possible   –  AWS  is  making  huge  investments  in  features/scale  •  Interfaces  that  isolate  Apps  from  AWS   –  Avoid  lock-­‐in  to  specific  AWS  API  details  •  Portability  is  a  long  term  goal   –  Gets  easier  as  other  vendors  catch  up  with  AWS  
  55. 55. Ne$lix  Global  PaaS  •  Architecture  Features  and  Overview  •  Portals  and  Explorers  •  Pla$orm  Services  •  Pla$orm  APIs  •  Pla$orm  Frameworks  •  Persistence  •  Scalability  Benchmark  
  56. 56. Global  PaaS?   Toys  are  nice,  but  this  is  the  real  thing…  •  Supports  all  AWS  Availability  Zones  and  Regions  •  Supports  mulMple  AWS  accounts  {test,  prod,  etc.}  •  Cross  Region/Acct  Data  ReplicaMon  and  Archiving  •  InternaMonalized,  Localized  and  GeoIP  rouMng  •  Security  is  fine  grain,  dynamic  AWS  keys  •  Autoscaling  to  thousands  of  instances  •  Monitoring  for  millions  of  metrics  •  ProducMve  for  100s  of  developers  on  one  product  •  23M+  users  USA,  Canada,  LaMn  America,  UK,  Eire  
  57. 57. Basic  PaaS  EnMMes  •  AWS  Based  EnMMes   –  Instances  and  Machine  Images,  ElasMc  IP  Addresses   –  Security  Groups,  Load  Balancers,  Autoscale  Groups   –  Availability  Zones  and  Geographic  Regions  •  Ne$lix  PaaS  EnMMes   –  ApplicaMons  (registered  services)   –  Clusters  (versioned  Autoscale  Groups  for  an  App)   –  ProperMes  (dynamic  hierarchical  configuraMon)  
  58. 58. Core  PaaS  Services  •  AWS  Based  Services   –  S3  storage,  to  5TB  files,  parallel  mulMpart  writes   –  SQS  –  Simple  Queue  Service.  Messaging  layer.  •  Ne$lix  Based  Services   –  EVCache  –  memcached  based  ephemeral  cache   –  Cassandra  –  distributed  data  store  •  External  Services   –  GeoIP  Lookup  interfaced  to  a  vendor   –  Keystore  HSM  in  Ne$lix  Datacenter  
  59. 59. Instance  Architecture  Linux  Base  AMI  (CentOS  or  Ubuntu)   OpMonal   Apache   frontend,   Java  (JDK  6  or  7)  memcached,  non-­‐java  apps   Tomcat   AppDynamics   appagent   Monitoring   Log  rotaMon   ApplicaMon  servlet,  base   Healthcheck,  status   to  S3   GC  and  thread   server,  pla$orm,  interface   servlets,  JMX  interface,  AppDynamics   dump  logging   jars  for  dependent  services   Servo  autoscale  machineagent   Epic    
  60. 60. Security  Architecture  •  Instance  Level  Security  baked  into  base  AMI   –  Login:  ssh  only  allowed  via  portal  (not  between  instances)   –  Each  app  type  runs  as  its  own  userid  app{test|prod}  •  AWS  Security,  IdenMty  and  Access  Management   –  Each  app  has  its  own  security  group  (firewall  ports)   –  Fine  grain  user  roles  and  resource  ACLs  •  Key  Management   –  AWS  Keys  dynamically  provisioned,  easy  updates   –  High  grade  app  specific  key  management  support  
  61. 61. Core  Pla$orm  Frameworks  and  APIs  
  62. 62. Portals  and  Explorers  •  Ne$lix  ApplicaMon  Console  (NAC)   –  Primary  AWS  provisioning/config  interface  •  AWS  Usage  Analyzer   –  Breaks  down  costs  by  applicaMon  and  resource  •  Cassandra  Explorer   –  Browse  clusters,  keyspaces,  column  families  •  Base  Server  Explorer   –  Browse  service  endpoints  configuraMon,  perf  
  63. 63. AWS  Usage  for  test,  carefully  omifng  any  $  numbers…  
  64. 64. Cassandra  Explorer  
  65. 65. Cassandra  Explorer  
  66. 66. Pla$orm  Services  •  Discovery  –  service  registry  for  “ApplicaMons”  •  IntrospecMon  –  Entrypoints  •  Cryptex  –  Dynamic  security  key  management  •  Geo  –  Geographic  IP  lookup  •  Pla$ormservice  –  Dynamic  property  configuraMon  •  LocalizaMon  –  manage  and  lookup  local  translaMons  •  Evcache  –  ephemeral  volaMle  cache  •  Cassandra  –  Cross  zone/region  distributed  data  store  •  Zookeeper  –  Distributed  CoordinaMon  (Curator)  •  Various  proxies  –  access  to  old  datacenter  stuff  
  67. 67. IntrospecMon  -­‐  Entrypoints  •  REST  API  for  tools,  apps,  explorers,  monkeys…   –  E.g.  GET  /REST/v1/instance/$INSTANCE_ID  •  AWS  Resources   –  Autoscaling  Groups,  EIP  Groups,  Instances  •  Ne$lix  PaaS  Resources   –  Discovery  ApplicaMons,  Clusters  of  ASGs,  History  
  68. 68. Entrypoints  Queries   MongoDB  used  for  low  traffic  complex  queries  against  complex  objects  DescripAon   Range  expression  Find  all  acMve  instances.     all()  Find  all  instances  associated  with  a  group   %(cloudmonkey)  name.  Find  all  instances  associated  with  a   /^cloudmonkey$/discovery()  discovery  group.    Find  all  auto  scale  groups  with  no  instances.   asg(),-­‐has(INSTANCES;asg())  How  many  instances  are  not  in  an  auto   count(all(),-­‐info(eval(INSTANCES;asg())))    scale  group?  What  groups  include  an  instance?   *(i-­‐4e108521)  What  auto  scale  groups  and  elasMc  load   filter(TYPE;asg,elb;*(i-­‐4e108521))  balancers  include  an  instance?  What  instance  has  a  given  public  ip?   filter(PUBLIC_IP;174.129.188.{0..255};all())  
  69. 69. Metrics  Framework  •  System  and  ApplicaMon   –  CollecMon,  AggregaMon,  Querying  and  ReporMng   –  Non-­‐blocking  logging,  avoids  log4j  lock  contenMon   –  Honu-­‐Streaming  -­‐>  S3  -­‐>  EMR  -­‐>  Hive  •  Performance,  Robustness,  Monitoring,  Analysis   –  Tracers,  Counters  –  explicit  code  instrumentaMon  log   –  Real  Time  Tracers/Counters   –  SLA  –  service  level  response  Mme  percenMles   –  Servo  annotated  JMX  extract  to  Cloudwatch  •  Latency  Monkey  Infrastructure   –  Inject  random  delays  into  service  responses  
  70. 70. ConfiguraAon  Management  •  Ne$lixConfiguraMon   –  ValidaMon  Framework   –  Sitewide  ProperMes  Explorer  •  Pla$ormService  •  Mapping  Service  •  ZooKeeper  (Curator)  •  InstanceIdenMty  
  71. 71. Interprocess  CommunicaAon  •  Discovery  Service  registry  for  “applicaMons”   –  “here  I  am”  call  every  30s,  drop  a?er  3  missed   –  “where  is  everyone”  call   –  Redundant,  distributed,  moving  to  Zookeeper  •  NIWS  –  Ne$lix  Internal  Web  Service  client   –  So?ware  Middle  Tier  Load  Balancer   –  Failure  retry  moves  to  next  instance   –  Many  opMons  for  encoding,  etc.  
  72. 72. Security  Key  Management  •  AKMS   –  Dynamic  Key  Management  interface   –  Update  AWS  keys  at  runMme,  no  restart   –  All  keys  stored  securely,  none  on  disk  or  in  AMI  •  Cryptex  -­‐  Flexible  key  store   –  Low  grade  keys  processed  in  client   –  Medium  grade  keys  processed  by  Cryptex  service   –  High  grade  keys  processed  by  hardware  (Ingrian)  
  73. 73. AWS  Persistence  Services  •  SimpleDB   –  Got  us  started,  migrated  to  Cassandra  now   –  NFSDB  -­‐  Instrumented  wrapper  library   –  Domain  and  Item  sharding  (workarounds)  •  S3   –  Upgraded/Instrumented  JetS3t  based  interface   –  Supports  mulMpart  upload  and  5TB  files   –  Global  S3  endpoint  management  
  74. 74. Aside:  Adrian’s  Rant  on  CAP  Theorem   Choose  Consistency  or  Availability  when  ParAAoned  •  Instances  and  Networks  will  fail  •  Network  failure  =  ParMMon  “P”  is  a  given  •  Distributed  Systems:  two  choices  –  CP  or  AP  •  “Vendor  claims  CA”   –  Usually  they  mean  available  when  instances  fail  •  Master-­‐Slave  =  Consistent  when  ParMMoned   –  You  can’t  write  unless  you  can  see  the  master  •  No-­‐Master  =  Available  when  ParMMoned   –  Writes  proceed,  conflicts  will  be  patched  up  later  
  75. 75. What  Ne$lix  Needed  from  NoSQL  
  76. 76. Basic  Requirements  •  Supports  running  on  Amazon  EC2  •  Supports  Amazon  Availability  Zones  •  Low  latency,  low  latency  variance  •  High  and  scalable  read  and  write  throughput  •  Large  and  scalable  capacity,  no  external  sharding  •  “AP”  Eventually  Consistent  •  Data  integrity  checks  and  repairs  •  Online  Snapshot  Backup,  Restore/Rollback  
  77. 77. Scenario  –  Immediate  Read  a?er  Write   Q1:  Is  rouMng  and  replicaMon  zone  aware?     TV  Device   New   New   Favorite   Round  Robin   Favorites   Load  Balancer   List   API   API   (zone  A)   (Zone  B)   Append   New   New   Favorites   Favorite   List   Favorites   Favorites   (zone  A)   (Zone  B)   ReplicaMon  
  78. 78. Network  ParMMon   Q2:  What  happens  next?   TV  Device   New   New   Favorite   Round  Robin   Favorites   Load  Balancer   List   API   API   (zone  A)   (Zone  B)  Append   New   New   Favorites  Favorite   List   Favorites   Favorites   (zone  A)   (Zone  B)   No  ReplicaMon  
  79. 79. Network  ParMMon  Q3:  Supports  Append  vs.  Read/Modify/Write?   TV  Device   New   New   Favorite   Round  Robin   Favorites   Load  Balancer   List   RMW   API   API   (zone  A)   (Zone  B)   Old   New   New  Favorites   Favorites   Favorites   List   List   List   Favorites   Favorites   (zone  A)   (Zone  B)   ReplicaMon  
  80. 80. Silent  Data  CorrupMon  Q4:  How  is  it  detected  and  corrected?     TV  Device   New   New   Favorite   Round  Robin   Favorites   Load  Balancer   List   API   API   (zone  A)   (Zone  B)  Append   New   New   Favorites  Favorite   List   Favorites   Favorites   (zone  A)   (Zone  B)   ReplicaMon  corrupted  on  disk  or  via  network  
  81. 81. NePlix  PlaPorm  Persistence  •  Ephemeral  VolaMle  Cache  –  evcache   –  Discovery-­‐aware  memcached  based  backend   –  Client  abstracMons  for  zone  aware  replicaMon   –  OpMon  to  write  to  all  zones,  fast  read  from  local  •  Cassandra   –  Highly  available  and  scalable  (more  later…)  •  MongoDB   –  Complex  object/query  model  for  small  scale  use  •  MySQL   –  Hard  to  scale,  legacy  and  small  relaMonal  models  
  82. 82. Why  Cassandra?  •  We  value  Availability  over  Consistency  –  AP   –  Cassandra  is  a  Java  distributed  systems  toolkit  •  We  have  a  building  full  of  Java  engineers   –  Riak  is  in  Erlang  –  a  blessing  and  a  curse…  •  We  want  FOSS  +  Support   –  Voldemort  doesn’t  have  a  support  model  •  Writes  are  intrinsically  harder  than  reads   –  Hbase  is  CP  opMmized  for  reads  &  single  namenode  issues  •  Cassandra  works,  running  ~55  clusters   –  Step  by  step  into  full  producMon  over  the  last  year  
  83. 83. Priam  –  Cassandra  AutomaMon   Available  at  hCp://$lix  •  Ne$lix  Pla$orm  Tomcat  Code  •  Zero  touch  auto-­‐configuraMon  •  State  management  for  Cassandra  JVM  •  Token  allocaMon  and  assignment  •  Broken  node  auto-­‐replacement  •  Full  and  incremental  backup  to  S3  •  Restore  sequencing  from  S3  •  Grow/Shrink  Cassandra  “ring”  
  84. 84. Astyanax   Available  at  hCp://$lix  •  Cassandra  java  client  •  API  abstracMon  on  top  of  Thri?  protocol  •  “Fixed”  ConnecMon  Pool  abstracMon  (vs.  Hector)   –  Round  robin  with  Failover   –  Retry-­‐able  operaMons  not  Med  to  a  connecMon   –  Ne$lix  PaaS  Discovery  service  integraMon   –  Host  reconnect  (fixed  interval  or  exponenMal  backoff)   –  Token  aware  to  save  a  network  hop  –  lower  latency   –  Latency  aware  to  avoid  compacMng/repairing  nodes  –  lower  variance  •  Batch  mutaMon:  set,  put,  delete,  increment  •  Simplified  use  of  serializers  via  method  overloading  (vs.  Hector)  •  ConnecMonPoolMonitor  interface  for  counters  and  tracers  •  Composite  Column  Names  replacing  deprecated  SuperColumns  
  85. 85. IniMalizing  Astyanax  // Configuration either set in code or nfastyanax.propertiesplatform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERYnetflix.environment=testdefault.astyanax.readConsistency=CL_QUORUMdefault.astyanax.writeConsistency=CL_QUORUMMyCluster.MyKeyspace.astyanax.servers= Must initialize platform for discovery to workNFLibraryManager.initLibrary(PlatformManager.class, props, false, true);NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);// Open a keyspace instanceKeyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
  86. 86. Astyanax  Query  Example  Paginate  through  all  columns  in  a  row  ColumnList<String>  columns;  int  pageize  =  10;  try  {          RowQuery<String,  String>  query  =  keyspace                  .prepareQuery(CF_STANDARD1)                  .getKey("A")                  .setIsPaginaMng()                  .withColumnRange(new  RangeBuilder().setMaxSize(pageize).build());                                      while  (!(columns  =  query.execute().getResult()).isEmpty())  {                  for  (Column<String>  c  :  columns)  {                  }          }  }  catch  (ConnecMonExcepMon  e)  {  }      
  87. 87. Data  MigraMon  to  Cassandra  
  88. 88. Distributed  Key-­‐Value  Stores  •  Cloud  has  many  key-­‐value  data  stores   –  More  complex  to  keep  track  of,  do  backups  etc.   –  Each  store  is  much  simpler  to  administer   DBA   –  Joins  take  place  in  java  code  •  No  schema  to  change,  no  scheduled  downMme  •  Latency  for  typical  queries   –  Memcached  is  dominated  by  network  latency  <1ms   –  Cassandra  takes  a  few  milliseconds   –  SimpleDB  replicaMon  and  REST  auth  overheads  >10ms  
  89. 89. MulA-­‐Regional  Data  ReplicaAon  •  IR  Framework  –  Datacenter  Item  Replicator   –  Built  in  2009,  first  step  to  the  cloud   –  Oracle  to  SimpleDB  or  Cassandra  via  poll  and  push   –  Return  updates  to  Oracle  via  SQS  message  queue  •  SimpleDB  or  S3  to  Cassandra   –  Data  migraMon  tool  for  global  Ne$lix  •  Global  SimpleDB  and  S3  ReplicaMon   –  Cross  region  async  updates  USA  to  Europe  
  90. 90. TransiAonal  Steps  •  BidirecMonal  ReplicaMon   –  Oracle  to  SimpleDB   –  Queued  reverse  path  using  SQS   –  Backups  remain  in  Datacenter  via  Oracle  •  New  Cloud-­‐Only  Data  Sources   –  Cassandra  based   –  No  replicaMon  to  Datacenter   –  Backups  performed  in  the  cloud  
  91. 91. API  AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   API  etc.   Load  Balancer   Component   API   SQS   Services   Oracl e   Oracle   Oracle  Cassandra   memcached   ReplicaMon   memcached   EC2   Internal   Disks   NePlix   S3   Data  Center   SimpleDB  
  92. 92. Cufng  the  Umbilical  •  TransiMon  Oracle  Data  Sources  to  Cassandra   –  Offload  Datacenter  Oracle  hardware   –  Free  up  capacity  for  growth  of  remaining  services  •  TransiMon  SimpleDB+Memcached  to  Cassandra   –  Primary  data  sources  that  need  backup   –  Keep  simplest  small  use  cases  for  now  •  New  challenges   –  Backup,  restore,  archive,  business  conMnuity   –  Business  Intelligence  integraMon  
  93. 93. API  AWS  EC2   Front  End  Load  Balancer   Discovery   Service   API  Proxy   Load  Balancer   Component   API   Services   memcached   Cassandra   EC2   Internal   Disks   Backup   S3   SimpleDB  
  94. 94. High  Availability  •  Cassandra  stores  3  local  copies,  1  per  zone   –  Synchronous  access,  durable,  highly  available   –  Read/Write  One  fastest,  least  consistent  -­‐  ~1ms   –  Read/Write  Quorum  2  of  3,  consistent  -­‐  ~3ms  •  AWS  Availability  Zones   –  Separate  buildings   –  Separate  power  etc.   –  Fairly  close  together    
  95. 95. “TradiMonal”  Cassandra  Write  Data  Flows   Single  Region,  MulMple  Availability  Zone,  Not  Token  Aware   Cassandra   • Disks   • Zone  A   2   2   4   2  1.  Client  Writes  to  any   Cassandra  3   3   Cassandra   If  a  node  goes  offline,   Cassandra  Node   • Disks   5 • Disks   5   hinted  handoff  2.  Coordinator  Node   • Zone  C   1 • Zone  A   completes  the  write   replicates  to  nodes   when  the  node  comes   and  Zones   Non  Token   back  up.  3.  Nodes  return  ack  to   Aware     coordinator   Clients   Requests  can  choose  to  4.  Coordinator  returns   3   wait  for  one  node,  a   Cassandra   Cassandra   ack  to  client   • Disks   • Disks   5   quorum,  or  all  nodes  to  5.  Data  wriCen  to   • Zone  C   • Zone  B   ack  the  write   internal  commit  log     disk  (no  more  than   Cassandra   SSTable  disk  writes  and   • Disks   10  seconds  later)   • Zone  B   compacMons  occur   asynchronously  
  96. 96. Astyanax  -­‐  Cassandra  Write  Data  Flows   Single  Region,  MulMple  Availability  Zone,  Token  Aware   Cassandra   • Disks   • Zone  A  1.  Client  Writes  to   Cassandra  2   2   Cassandra   If  a  node  goes  offline,   nodes  and  Zones   • Disks   3 • Disks   3   hinted  handoff  2.  Nodes  return  ack  to   • Zone  C   1 • Zone  A   completes  the  write   client  3.  Data  wriCen  to   Token   when  the  node  comes   back  up.   internal  commit  log   Aware     disks  (no  more  than   Clients   2   Requests  can  choose  to   10  seconds  later)   Cassandra   Cassandra   wait  for  one  node,  a   • Disks   • Disks   3   quorum,  or  all  nodes  to   • Zone  C   • Zone  B   ack  the  write     Cassandra   SSTable  disk  writes  and   • Disks   • Zone  B   compacMons  occur   asynchronously  
  97. 97. Data  Flows  for  MulM-­‐Region  Writes   Token  Aware,  Consistency  Level  =  Local  Quorum  1.  Client  writes  to  local  replicas   If  a  node  or  region  goes  offline,  hinted  handoff  2.  Local  write  acks  returned  to   completes  the  write  when  the  node  comes  back  up.   Client  which  conMnues  when   Nightly  global  compare  and  repair  jobs  ensure   2  of  3  local  nodes  are   everything  stays  consistent.   commiCed  3.  Local  coordinator  writes  to   remote  coordinator.     Cassandra   100+ms  latency  4.  When  data  arrives,  remote   Cassandra   •  Disks   •  Disks   •  Zone  A   •  Zone  A   coordinator  node  acks  and   Cassandra   2   2   Cassandra   Cassandra   4   Cassandra   6   6   3   5   Disks  6   copies  to  other  remote  zones   6   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  •  Zone  C   4  Disks  A   •  •  Zone   1   4  5.  Remote  nodes  ack  to  local   US   EU   coordinator   Clients   Clients   Cassandra   2   Cassandra   Cassandra   5   Cassandra  6.  Data  flushed  to  internal   •  Disks   •  Zone  C   •  Disks   6   •  Zone  B   •  Disks   •  Zone  C   •  Disks  6   •  Zone  B   commit  log  disks  (no  more   Cassandra   Cassandra   than  10  seconds  later)   •  Disks   •  Disks   •  Zone  B   •  Zone  B  
  98. 98. Remote  Copies  •  Cassandra  duplicates  across  AWS  regions   –  Asynchronous  write,  replicates  at  desMnaMon   –  Doesn’t  directly  affect  local  read/write  latency  •  Global  Coverage   –  Business  agility   –  Follow  AWS…   ?•  Local  Access   ? ? –  BeCer  latency   3 A 3 –  Fault  IsolaMon    
  99. 99. Cassandra  Backup    •  Full  Backup   Cassandra   Cassandra   Cassandra   –  Time  based  snapshot   –  SSTable  compress  -­‐>  S3   Cassandra   Cassandra  •  Incremental   S3   Backup   Cassandra   Cassandra   –  SSTable  write  triggers   compressed  copy  to  S3   Cassandra   Cassandra  •  Archive   Cassandra   Cassandra   –  Copy  cross  region   A  
  100. 100. Cassandra  Restore  •  Full  Restore   Cassandra   Cassandra   Cassandra   –  Replace  previous  data  •  New  Ring  from  Backup   Cassandra   Cassandra   –  New  name  old  data   S3   Backup   Cassandra   Cassandra  •  Scripted   –  Create  new  instances   Cassandra   Cassandra   –  Parallel  load  -­‐  fast   Cassandra   Cassandra  
  101. 101. Cassandra  Online  AnalyMcs  •  Brisk  =  Hadoop  +  Cass   Cassandra   –  “Cassandra  Enterprise”   Brisk   Cassandra   –  Use  split  Brisk  ring   Brisk   Cassandra   –  Size  each  separately   S3  •  Direct  Access   Cassandra   Backup   Cassandra   –  Keyspaces   –  Hive/Pig/Map-­‐Reduce   Cassandra   Cassandra   –  Hdfs  as  a  keyspace   Cassandra   Cassandra   –  Distributed  namenode  
  102. 102. ETL  for  Cassandra  •  Data  is  de-­‐normalized  over  many  clusters!  •  Too  many  to  restore  from  backups  for  ETL  •  SoluMon  –  read  backup  files  using  Hadoop  •  Aegisthus   –  hCp://$­‐bulk-­‐data-­‐pipeline-­‐out-­‐of.html   –  High  throughput  raw  SSTable  processing   –  Re-­‐normalizes  many  clusters  to  a  consistent  view   –  Extract,  Transform,  then  Load  into  Teradata  
  103. 103. Cassandra  Archive   A   Appropriate  level  of  paranoia  needed…  •  Archive  could  be  un-­‐readable   –  Restore  S3  backups  weekly  from  prod  to  test,  and  daily  ETL  •  Archive  could  be  stolen   –  PGP  Encrypt  archive  •  AWS  East  Region  could  have  a  problem   –  Copy  data  to  AWS  West  •  ProducMon  AWS  Account  could  have  an  issue   –  Separate  Archive  account  with  no-­‐delete  S3  ACL  •  AWS  S3  could  have  a  global  problem   –  Create  an  extra  copy  on  a  different  cloud  vendor….  
  104. 104. Extending  to  MulM-­‐Region   In  producMon  for  UK/Eire  support  1.  Create  cluster  in  EU   Take  a  Boeing  737  on  a  domesMc  flight,  upgrade  it  to   a  747  by  adding  more  engines,  fuel  and  bigger  wings  2.  Backup  US  cluster  to  S3   and  fly  it  to  Europe  without  landing  it  on  the  way…  3.  Restore  backup  in  EU  4.  Local  repair  EU  cluster  5.  Global  repair/join   Cassandra   100+ms  latency   Cassandra   1   •  Disks   •  Disks   •  Zone  A   •  Zone  A   Cassandra   Cassandra   Cassandra   Cassandra   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  A   •  Zone  C   •  Zone  A   US   5   EU   Clients   Clients   Cassandra   Cassandra   Cassandra   Cassandra   •  Disks   •  Disks   •  Disks   •  Disks   •  Zone  C   •  Zone  B   •  Zone  C   •  Zone  B   Cassandra   Cassandra   •  Disks   •  Disks   •  Zone  B   3   •  Zone  B   4   2   S3  
  105. 105. Tools  and  AutomaMon  •  Developer  and  Build  Tools   –  Jira,  Perforce,  Eclipse,  Jenkins,  Ivy,  ArMfactory   –  Builds,  creates  .war  file,  .rpm,  bakes  AMI  and  launches  •  Custom  Ne$lix  ApplicaMon  Console   –  AWS  Features  at  Enterprise  Scale  (hide  the  AWS  security  keys!)   –  Auto  Scaler  Group  is  unit  of  deployment  to  producMon  •  Open  Source  +  Support   –  Apache,  Tomcat,  Cassandra,  Hadoop   –  Datastax  support  for  Cassandra,  AWS  support  for  Hadoop  via  EMR  •  Monitoring  Tools   –  Alert  processing  gateway  into  Pagerduty   –  AppDynamics  –  Developer  focus  for  cloud  hCp://  
  106. 106. NoSQL  Developer  MigraMon  •  Jason  Brown  @jasobrown   –  Cassandra  from  the  Trenches   –$lix  •  Mark  Atwood,  "Guide  to  NoSQL,  redux”   –  YouTube  hCp://  
  107. 107. Open  Sourcing  the  Ne$lix  PaaS  
  108. 108. Open  Source  Strategy  •  Release  PaaS  Components  git-­‐by-­‐git   –  Source  at$lix   –  Intros  and  techniques  at$   –  Blog  post  or  new  code  every  week  or  so  •  MoMvaMons   –  Give  back  to  Apache  licensed  OSS  community   –  MoMvate,  retain,  hire  top  engineers   –  Create  a  community  that  adds  features  and  fixes  
  109. 109. Current  OSS  Projects  and  Posts  Github  /  Techblog   Priam   Exhibitor   Servo   Apache  Project   Techblog  Post   Astyanax   Curator   Autoscaling  scripts   CassJMeter   Zookeeper   Honu   Cassandra   EVCache   Circuit  Breaker   Aegisthus  
  110. 110. Takeaway     NePlix  has  built  and  deployed  a  scalable  global  PlaPorm  as  a  Service.    Key  components  of  the  NePlix  PaaS  are  being  released  as  Open  Source   projects  so  you  can  build  your  own  custom  PaaS.     hCp://$lix   hCp://$   hCp://$lix     hCp://   @adrianco  #ne$lixcloud     End  of  Part  2  of  3