Scaling and Distributing

682 views
539 views

Published on

Building and scaling distributed infrastructure. Presented at OdessaPy on Dec 7, 2013.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
682
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Scaling and Distributing

  1. 1. Scaling and Distributing Dima Nedbaylo Dima Malenko
  2. 2. So, you want to tell us about distributed systems?
  3. 3. OpenOffice.org on iPad
  4. 4. Systems are distributed for a reason Our reasons were following: • low latency for end-user connections • horizontal scalability • availability
  5. 5. Systems are distributed for a reason Our reasons were following: • low latency for end-user connections • horizontal scalability • availability
  6. 6. Historia de número uno Práctico
  7. 7. We have to be close to the user Every 5ms of ping between the user and the server counts
  8. 8. Computing power provider with multiple locations and great managements capabilities?
  9. 9. Computing power provider with multiple locations and great managements capabilities?
  10. 10. [Only] 8 locations [Just] $0.6/hour c3.2xlarge 14 ECU 15 GB RAM 2*80 GB SSD $0.6 * 720 = $432/mo
  11. 11. So, you’ve got new server… • Minimal setup of Ubuntu 12.04 • Magic fabric script to turn minimal Ubuntu into rollApp app server • Works great and allows to get server up and running within couple of hours
  12. 12. Now you’ve got 10 servers… • And need to update one of your components • Or run a maintenance procedure • Or install a couple of new packages • Or correct config
  13. 13. Ansible http://www.ansibleworks.com • Learned in just 1 hour • In 2 hours we had a script to setup new app server • Way better and more reliable than fabric for the same purpose • Way easier than chef
  14. 14. Ansible vs. Fabric • Requires hosts inventory database • If you need something very custom – have to code in Python 00 00 00 00 00 00 00
  15. 15. Ansibe Ad-hoc Ansible has a lot of modules for ad-hoc mode: • downloading/uploading files • managing packages, users, services • Launching EC2 instances and other AWS stuff • databases and db users operations
  16. 16. Inventory Exmple [appservers_eu] appsrv-007.rollapp.com ansible_ssh_user=guess_who ansible_ssh_private_key_file=... [appservers:children] appservers_eu [appservers:vars] root_password='password'
  17. 17. Playbooks • Plain yaml file • Contains server configuration • Per se it is just set of tasks that invoke ansible modules
  18. 18. Simple Playbook --- include: playbooks/timezone.yml - include: playbooks/ntp.yml - hosts: appservers sudo: yes vars: prefix_dir: /opt tasks: # user: dnedbaylo - name: create dnedbaylo user user: name=dnedbaylo groups=admin shell=/bin/zsh - name: authorized keys for dnedbaylo authorized_key: user=dnedbaylo key=”ssh-rsa ….”
  19. 19. Ansible vs. Fabric: playbook mode • Plain language (YAML) for playbooks • A lot of modules ready to use (like creating EC2 instances, users management, apt repositories management, etc.) • No need to worry about details (does user already exist?) • Playbooks are “idempotent”
  20. 20. Ansible vs. Chef
  21. 21. Ansible vs. Chef • No need to learn chef • No need to learn ruby • No weird ruby requirements (not so easy to install chef on Linux Mint) • No need to use additional tools to make life with chef solo easier (hello knife-solo)
  22. 22. Historia de número dos Instructivo
  23. 23. The difficulty with distributed systems is that they are … distributed
  24. 24. My First Law of Distributed Objects Design: Do not distribute your objects http://martinfowler.com/bliki/FirstLaw.html
  25. 25. At all times protect integrity of the system… at all cost
  26. 26. Put on your oxygen mask first before helping others
  27. 27. App Server 1 App Server 2 Web App Server 3 App Server N
  28. 28. Web keeps track of consequent errors for each app server App Server 1 App Server 2 Web App Server 3 App Server N
  29. 29. Web keeps track of consequent errors for each app server App Server 1 App Server 2 Web App Server 3 App Server monitors its internal state and deactivates itself if bad things happen App Server N
  30. 30. App Server 1 App Server 2 Web App Server 3 App Server N
  31. 31. App Server 1 App Server 2 Web App Server 3 App Server N
  32. 32. App Server 1 App Server 2 Web App Server 3 App Server N
  33. 33. App Server 1 App Server 2 Web App Server 3 App Server N
  34. 34. What Happened?
  35. 35. App Server 1 App Server 2 Web App Server 3 App Server N
  36. 36. OOM killer engaged • pings work • simple status checks work(!) App Server 1 App Server 2 Web App Server 3 App Server N
  37. 37. Requests still come in, but never get actually processed App Server 1 App Server 2 Web App Server 3 App Server N
  38. 38. DB connections pool got saturated. Old requests hung, new requests fail App Server 1 App Server 2 Web App Server 3 App Server N
  39. 39. Irregular errors and failures because of resource starvation App Server 1 App Server 2 Web App Server 3 App Server N
  40. 40. Верить нельзя никому, порой даже себе… Мне – можно!
  41. 41. Lessons Learned • Timeouts on all connections to other components • Monitoring beyond just vitality signs • Keep track of “in progress” requests to prevent cascading errors
  42. 42. Things to Remember • [Almost] all modern applications are distributed • Never trust any external interface • Be ready to sacrifice part to keep the entire system afloat • Monitor each interface from both sides on different layers (not just pinging)
  43. 43. Never Trusting Is Not Easy • No (!) out of the box solutions for controlling response timeouts – requests only has connection timeouts – if you are on gevent or the like – you are good to go with greenlet timeouts • [Almost] always opt for aggressive health control parameters – request processing time – max address space – max number of queued requests
  44. 44. Historia de número tres Inesperado
  45. 45. Get application startup time optimization
  46. 46. Here ought to be a screenshot but it is not here
  47. 47. Here ought to be a screenshot but it is not here Always (I mean ALWAYS) make screenshots when you come across something interesting
  48. 48. Application Startup User Browser preparing to launch connect to application App Server Web launch application 52
  49. 49. Close to one another Application Startup User Browser Web Faster App Server Slower 53
  50. 50. Application Startup User Browser Web App Server 54
  51. 51. Do not take anything for granted
  52. 52. Any questions? Now: and later: dnedbaylo@rollapp.com @dmalenko dmalenko@rollapp.com www.dmalenko.org

×