Advertisement
Advertisement

More Related Content

Similar to How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013(20)

Advertisement

More from Amazon Web Services(20)

Advertisement

How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

  1. How Parse built a mobile backend as a service Charity Majors November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Friday, November 15, 13
  2. What is Parse? • platform for mobile developers • • • • iOS, Android, WinRT API and native SDKs Scales automatically to handle traffic Analytics, cloud code, file storage, push notifications, hosting Friday, November 15, 13
  3. Parse is magic. Friday, November 15, 13
  4. Parse is built on AWS • Parse has never touched bare metal • Recently acquired by Facebook • Current plan is to stay on AWS • We love AWS! Friday, November 15, 13
  5. Parse is growing fast • Developers • Apps • API usage • Nodes and compute resources • Connected devices Friday, November 15, 13
  6. Friday, November 15, 13 9/11/13 8/11/13 7/11/13 6/11/13 5/11/13 4/11/13 3/11/13 2/11/13 1/11/13 12/11/12 11/11/12 10/11/12 9/11/12 8/11/12 7/11/12 6/11/12 5/11/12 4/11/12 3/11/12 2/11/12 1/11/12 12/11/12 11/11/11 10/11/11 9/11/11 8/11/11 7/11/11 6/11/11 Developers
  7. Friday, November 15, 13 9/11/13 8/11/13 7/11/13 6/11/13 5/11/13 4/11/13 3/11/13 2/11/13 1/11/13 12/11/12 11/11/12 10/11/12 9/11/12 8/11/12 7/11/12 6/11/12 5/11/12 4/11/12 3/11/12 2/11/12 1/11/12 12/11/12 11/11/11 10/11/11 9/11/11 8/11/11 7/11/11 6/11/11 Developers When PARSE was acquired
  8. Top left: Parse Grid Load last year Top Right: Number of Hits last year Bottom Left: Active PPNS Connections last year Friday, November 15, 13
  9. 1.5 years ago Friday, November 15, 13
  10. Now Friday, November 15, 13
  11. Parse ops philosophy • Work smarter, not harder • Small team, full stack generalists • Automate, automate, automate • Our goal: • 80% time working on things we want to do • 20% time working on things we have to do Friday, November 15, 13
  12. Past & Present October 2012 • 60% time spent on must-do’s • 40% time spent on want to do’s • ~400 event alerts • Very sleepy opsen October 2013 • 20% time spent on must-do’s • 80% time spent want to do’s • ~100 event alerts (mostly daytime) • Infra complexity has 5x’d but time to • Friday, November 15, 13 manage it has dropped We have shifted a lot of work from ourselves to AWS
  13. Takeaways • ASGs are your best friend • Automation should be reusable • Choose your source of truth carefully Friday, November 15, 13
  14. Parse stack Friday, November 15, 13
  15. Friday, November 15, 13
  16. Infrastructure design choices • Chef • Amazon Route 53 • Use real hostnames • Distribute evenly across 3 AZs • Fail over automatically • Single source of truth Friday, November 15, 13
  17. Amazon EC2 design choices • Standardize on a few instance types • Makes reserved instances more efficient • We use m1.large, m1.xlarge, m2.4xlarge (multi-core is a must). Prefer many small disposable instances for stateless services. • Security groups • One group per role • Verify working set with expected set using git/nagios • All inbound requests come through Elastic Load Balancing • Nothing talks directly to Amazon EC2 instances Friday, November 15, 13
  18. Friday, November 15, 13
  19. API path • • • • • Elastic Load Balancing nginx haproxy Ruby app servers (unicorns) Go api servers (go rewrite from the ground up) • Go logging servers to FB endpoint Friday, November 15, 13
  20. Friday, November 15, 13
  21. Hosting • Elastic Load Balancing • Elastic IPs for apex domain redirect service • Go service that wraps cloud code and Amazon S3 Friday, November 15, 13
  22. Friday, November 15, 13
  23. Cloud code • • • • Server-side javascript in v8 virtual machine Third-party modules for partners (Stripe, Twilio, etc.) Restrictive security groups Scrub IPs with squid Friday, November 15, 13
  24. Push • • • • • Resque on redis Billions of pushes per month 700/sec steady state Spikes to 10k/sec (15x burst) PPNS holds sockets open to all android devices • PDNS to serve android phonehome IPs Friday, November 15, 13
  25. Friday, November 15, 13
  26. MongoDB • • • • • • • 12 replica sets, ~50 nodes, 2-4 TB per rs Over 1M collections Over 170k schemas Autoindexing of keys based on entropy Compute compound indexes from real traffic analysis Implemented our own app-level sharding PIOPS (striped RAID, 2000-8000 PIOPS/vol) • totally saved our bacon. Amazon EBS was a killer. • Fully instrumented provisioning with chef Friday, November 15, 13
  27. Memcache • Pool of memcaches with consistent hash • I would use ElastiCache instead next time Friday, November 15, 13
  28. Redis • Queueing using resque • Android outboxes • Single-threaded • Just started playing with ElastiCache redis Friday, November 15, 13
  29. MySQL • Trivially tiny and we would love to get rid of it • ... but rails • Considered Amazon RDS • • • • No chained replication Visibility is challenging Even tiny periodic blips impact the API ... but AZ failover would be sooo nice Friday, November 15, 13
  30. Cassandra • Powers the front-end Parse Analytics • Super fast writes and increments • 12 node cluster of m2.4xlarge • Ephemeral storage • Cheap & won our benchmarks Friday, November 15, 13
  31. Cassandra + Priam • Initial token assignments • Incremental backups to Amazon S3 • Uses Auto Scaling groups • Amazon SimpleDB for tracking tokens, instance identities • Non-trivial to set up but WORTH IT Friday, November 15, 13
  32. Infrastructure Friday, November 15, 13
  33. First-generation infrastructure Characteristics • Ruby on Rails everywhere • Chef to build AMIs • Chef role per service • Capistrano to deploy code • Source of truth: git Friday, November 15, 13
  34. First-generation infrastructure Characteristics Effects • Ruby on Rails everywhere • Chef to build AMIs • Chef role per service • Capistrano to deploy code • Source of truth: git • Sooo • Make the same change in many places • Full deploy and restart any time a single host is added or removed • Fine Friday, November 15, 13 much hand-editing for small static host sets
  35. How to deploy 20 new servers: • Run 20 knife-ec2 commands to launch 20 hosts, • Edit the cap deploy file, • Edit the yml files, push to git, • Do a cap cold deploy to new hosts, • Do a full deploy/restart to all the services that need to talk to the new hosts Friday, November 15, 13 Total time elapsed: 1.5–2.5 hours
  36. How to deploy 20 new servers: • Run 20 knife-ec2 commands to launch 20 hosts, • Edit the cap deploy file, • Edit the yml files, push to git, • Do a cap cold deploy to new hosts, • Do a full deploy/restart to all the services that need to talk to the new hosts Friday, November 15, 13 Total time elapsed: 1.5–2.5 MG. O t ok hours no
  37. PROBLEMS • Babysitting • Maintaining machine lists by hand • No consistent human readable host naming • Requires full code deploy to add single node • Humans have to know things and make decisions Friday, November 15, 13
  38. Second-generation infrastructure Characteristics • Ruby on Rails everywhere • Chef to configure systems • Chef to generate host lists • Capistrano to deploy code • Source of truth: chef Friday, November 15, 13
  39. Second-generation infrastructure Characteristics Effects • Ruby on Rails everywhere • Chef to configure systems • Chef to generate host lists • Capistrano to deploy code • Source of truth: chef • YML files, haproxy configs, etc generated every chef run • No longer need to do full deploys to affected services, just restart • Only one set of files to maintain by hand (capistrano) Friday, November 15, 13
  40. How to deploy 20 new servers: • Run 20 knife-ec2 commands to launch 20 hosts • Edit the cap deploy file • Do a cap cold deploy to new hosts • Let chef-client run to generate YML files • Restart services that need to talk to the new hosts Friday, November 15, 13 Total time elapsed: 30-60 minutes
  41. How to deploy 20 new servers: • Run 20 knife-ec2 commands to launch 20 hosts • Edit the cap deploy file • Do a cap cold deploy to new hosts • Let chef-client run to generate YML files • Restart services that need to talk to the new hosts Friday, November 15, 13 Total time elapsed: 30-60 ILL ST ! minutes ok not
  42. what are our primary goals? • Scale up any class of service in < 5 minutes • Automatically detect new nodes • Automatically remove downed nodes from service • No hand maintained lists ANYWHERE (ugh) • Deploy fast—no time to build AMIs • Option of deploying from master • Design a new deploy process for go binaries Friday, November 15, 13
  43. putting together a solution Auto Scaling Groups Jenkins + Amazon S3 • Each service lives in an ASG • Same AMI used for most services • Base AMI generated by chef • System state managed by chef • ASG named after chef role • Runs unit tests • Generate a tarball Friday, November 15, 13 artifact for each successful build • Upload to Amazon S3, tag with the build # and role
  44. autoification auto-bootstrap auto-deploy • Runs on first boot • Infers chef role from ASG name • Generates a client.rb and initial • infers runlist DNS with Amazon Route 53 a lock from zookeeper, so DNS is atomic • Bootstraps chef • Auto-deploy Friday, November 15, 13 name • pulls build artifact from Amazon S3 • Registers • Grabs the chef role from ASG • unpacks tarball, restarts
  45. a better source of truth zookeeper • We LOVE zookeeper!! • Service registration, service discovery • Distributed locking • Coordinated actions, Friday, November 15, 13 unique ids
  46. a better source of truth zookeeper how it works • We LOVE zookeeper!! • Service registration, service • zkwatcher discovery • Distributed locking • Coordinated actions, detects the service is up, establishes an ephemeral node to zk unique ids • Or the service registers itself • Ephemeral node goes away, service gets deregistered • Capistrano asks zookeeper for the list of alive servers to deploy to Friday, November 15, 13
  47. Third-generation infrastructure Characteristics • Some go, some ruby • Chef to maintain state • ASG per chef role • Capistrano + zk + jenkins Amazon S3 • Source Friday, November 15, 13 of truth: zookeeper +
  48. Third-generation infrastructure Characteristics Effects • Some go, some ruby • Chef to maintain state • ASG per chef role • Capistrano + zk + jenkins • No lists of hosts • No manual labor • Happy opsen Amazon S3 • Source Friday, November 15, 13 of truth: zookeeper +
  49. Deploy 20 new servers: • Adjust • Have a the size of the ASG cocktail Total time elapsed: 5-10 minutes Friday, November 15, 13
  50. Deploy 20 new servers: • Adjust • Have a the size of the ASG cocktail Total time elapsed: 5-10 minutes Y! YA Friday, November 15, 13
  51. ASG caveats • Amazon CloudWatch triggers are minimally useful for us • Our bursts are usually too short and sharp • No periodicity to our traffic patterns • ... but we are lazy so we would like to add them anyway • Need more tooling around downsizing ASGs gracefully • Initial chef run may take 5-7 minutes • Could someday optimize this • Or eat the overhead of building AMIs with each successful jenkins build Friday, November 15, 13
  52. Remaining issues • When we get rid of ruby, get rid of cap • Just use auto-deploy for everything • Trigger a deploy by updating build version # in zookeeper • Automatic failover for mysql and redis • Move everything into VPC • ASGs will really help with this! • Then we can use internal load balancers instead of haproxy. Want badly. Friday, November 15, 13
  53. Takeaways • Single source of truth, or multiple sources of lies • The more real-time your source of truth, the faster your response time can be • ASGs are amazing <3 <3 Friday, November 15, 13
  54. Q&A Friday, November 15, 13
  55. Please give us your feedback on this presentation MBL307 As a thank you, we will select prize winners daily for completed surveys! Friday, November 15, 13 Thank You
Advertisement