Automation at Brainly
… or how to enter the world of automation in a “different way”.
OPS stack:
● ~80 servers, heavy usage of LXC containers
(~1000)
● 99.9% Debian, 1 Ubuntu host :)
● Nginx / Apache2, 2k reqs per sec
● 200 million page views monthly
● 700Mbps peak traffic
● Python is dominant
About Brainly
World’s largest homework help social network, connecting over 40 million users monthly
DEV stack:
● PHP
- Symfony 2
- SOA projects
- 200 reqs per sec on russian version
● Erlang
- 55k concurrent users
- 22k events per sec
● Native Apps
- iOS
- Android
● Puppet was not feasible for us
- *lots* of dependencies which make containers bigger/heavier
- problems with Puppet's declarative language
- seemed incoherent, lacking integration of orchestration
- steep learning curve
- YMMV
● "packaging as automation" as an intermediate solution
- dependency hell, installing one package could result in uninstalling others
- inflexible, lots of code duplication in debian/rules file
- LOTS of custom bash and PHP scripts, usually very hard to reuse
and not standardized
- this was a dead end :(
● Ansible
- initially used only for orchestration
- maintaining it required keeping up2date inventory, which later
simplified and helped with lots of things
Starting point
● we decided to move forward with Ansible and use it for setting up machines as
well
● first project was nagios monitoring plugins setup
● turned out to be ideal for containers and our needs in general
- very little dependencies to begin with (python2, python-apt),
and small footprint - "configured" Python modules are transferred
directly to machine, no need for local repositories
- very light, no compilation on the destination host is needed
- easy to understand. Tasks/playbooks map directly to actions
an ops/devops would have done if he was doing it by hand
- compatible with "automation by packages". We were able to
migrate from the old system in small steps.
First steps with Ansible
● all policies, rules, and good practices written down in automation's repo main
directory
● helps with introducing new people into the team or with devops approach
- newbies are able to start committing to repo quickly
- what's in GUIDELINES.md, that's law and changing it requires wider
consensus
- gives examples on how to deal with certain problems in standardized way
● few examples:
- limit the number of tags, each of them should be self-contained
with no cross-dependencies.
- do not include roles/tasks inside other roles,
this creates hard to follow dependencies
- NEVER subset the list of hosts inside the role, do it in site.yml.
Otherwise debugging roles/hosts will become difficult
- think twice before adding new role and esp. groups. As infrastructure
grows, it becomes hard to manage and/or creates "dead” code/roles
Avoiding regressions
● one of the policies introduced was storing one-off scripts in a
separate directory in our automation repo.
● most of them are Ansible playbooks used just for one particular
task (i.e. Squeeze->Wheezy migration)
● version-control everything!
● turned out to be very useful, some of them turned out to be useful
enough to be rewritten to proper role or a tool
Ugly-hacks reusability
● available on GitHub and Ansible Galaxy:
https://galaxy.ansible.com/list#/roles/940
https://galaxy.ansible.com/list#/roles/941
● “base” role:
- is reused across 8 different production roles we have ATM
- contains basic monitoring, log rotation, packages installation, etc…
- includes PHP setup in modphp/prefork configuration
- PHP disabled functions control
- basic security setup
- does not include any site-specific stuff
● "site” role:
- contains all site specific stuff and dependencies
(vhosts, additional packages, etc...)
- usually very simple
- more than one site role possible, only one base role though
● It is an example of how we make our roles reusable
Apache2 automation
● automatically setups monitoring basing on inventory and host groups
● implements devops approach - if dev has root on machine, he also has
access to all monitoring stuff related to this system
● automatic host dependencies basing on host groups
● provisioning new hosts is no longer so painful ("auto-discovery")
● all services configuration is stored as YAML files, and used in templates
● role uses DNS data directly from inventory in order to make monitoring
independent of DNS failures
Icinga
DNS migration
● at the beginning:
- dozens of authoritative name servers, each of them having
customized configuration, running ~100 zones, all created by hand
- the main reason for that was using DNS for switching between
primary/secondary servers/services
● three phases:
- slurping configuration into Ansible
- normalizing the configuration
- improving the setup
● Python script which uses Ansible API to fetch normalized zone configuration from
each server
- results available in a neat hash, with per-host, per-zone keys!
- normalization using named-checkconf tool
● use slurped configuration to re-generate all configs, this time using only the data
available to Ansible's
● "push-button" migration, after all recipes were ready :)
● secure: all zone transfers are signed with individual keys, ACLs are tight
● playbooks use dns data directly from inventory
● changing/migrating slaves/masters is easy, NS records are auto-generated
● updates to zones automatically bump serial, while still preserving the
YYYYMMDDxx format
● CRM records are auto-generated as well
* see next slide about CRM automation
● dns entries are always up2date thanks to some custom action modules
- ansible_ssh_host variables are harvested and processed into zones
- only custom entries and zone primary/secondary server names are
now stored in YAML
- new hosts are automatically added to zones, decommissioned
ones - removed
- auto-generation of reverse zones
DNS automation
● we have ~130 CRM clusters
● setting them up by hand would be "difficult" at best, impossible at worst
● available on Ansible Galaxy:
- https://galaxy.ansible.com/list#/roles/956
- https://galaxy.ansible.com/list#/roles/979
● follows pattern from apache2_base
- “base” role suitable for manually set up clusters
- "cluster” role provides service upon base, with few reusable snippets
and a possibility for more complex configurations
● automatic membership based on ansible inventory (no multicasts!)
● the most difficult part was providing synchronous handlers
● few simple configurations are provided, like single service-single vip
Corosync & Pacemaker
● initially we did not have time nor resources to set up full fledged LDAP
● we needed:
- user should be able to log in even during a network outage
- removal/adding users, ssh-keys, custom settings, etc..
all had to be supported
- it had to be reusable/accessible in other roles
(i.e. Icinga/monitoring)
- different privileges for dev,production and other environments
- UID/GID unification
● turned out to be simpler than we thought - users are managed using few
simple tasks and group_vars data. Rest is handled via variables precedence.
● migration/standardization required some effort though
User management automation
● standard ansible inventory management becomes a bit cumbersome with 100’s of
hosts:
- each host has to have ansible_ssh_host defined
- adding/removing large number of hosts/groups required editing lots of files
and/or one-off scripts
- ip address management using google docs does not scale ;)
● Ansible has well defined dynamic inventory API, with scripts available for AWS,
Cobbler, Rackspace, Docker, and many others.
● we wrote our own, which is based on YAML file, version controlled by git:
- python API allowing to manipulate the inventory easily
- logic and syntax checking of the inventory
● available as opensource: https://github.com/brainly/inventory_tool
Inventory management
● we are leasing our servers from Hetzner, no direct Layer 2 connectivity
● all tunnel setups are done using Ansible, new server
is automatically added to our network
● firewalls are set up by Ansible as well:
- OPS contribute the base firewall, DEVs can open
the ports of interest for their application
- ferm at it's base, for easy rule making and keeping in-kernel firewall in sync
with on-disk rules
- rules are auto-generated basing on inventory, adding/removing hosts is
automatically reconfigures FW
Networking
● based on Bareos, opensource Bacula fork
● new hosts are automatically set up for backup,
extending storage space is no longer a problem
● authentication using certificates, PITA without ansible
Backups
● deployment done by Python script calling Ansible API
● simple tasks implemented using ansible playbooks
● complex logic implemented in Python
Deployments
● Jinja2 template error messages are "difficult" to interpret
● templates sometimes grow to huge complexity
● Jinja2 is designed for speed, but with tradeoffs - some Python operators are
missing and creating custom plugins/filters poses some problems
● multi-inheritance, problems with 2-headed trees
● speed, improved with "pipelining=True", containerization on the long run
● some useful functionality requires paid subscription (Ansible Tower)
- RESTfull API, useful if you want to push new application version
to productions via i.e. Jenkins
- schedules - currently we need to push the changes ourselves
Not everything is perfect
● developers by default have RO access to repo, RW on case-by-case basis
● changes to systems owned by developers are done by developers,
OPS only provide the platform and tools
● all non-trivial changes require a Pull Request and a review from Ops
● encrypt mission critical data with Ansible Vault and push it directly to the repo
- *strong* encryption
- available to Ansible without the need for decryption
(password still required though)
- all security sensitive stuff can be skipped by developers with
"--skip-tags" option to ansible-playbooks
Dev,DevOps,Ops
● some of the things we mentioned can be find on our Github account
● we are working on opensourcing more stuff
https://github.com/brainly
Opensource! Opensource! Opensource!
● time needed to deploy new markets dropped considerably
● increased productivity
● better cooperation with developers
● more workpower, Devs are no longer blocked so much, we can push
tasks to them
● infrastructure as a code
● versioning
● code-reuse, less copy-pasting
Conclusions
We are hiring!
http://brainly.co/jobs/
Questions?
Thank you!

PLNOG Automation@Brainly

  • 1.
    Automation at Brainly …or how to enter the world of automation in a “different way”.
  • 2.
    OPS stack: ● ~80servers, heavy usage of LXC containers (~1000) ● 99.9% Debian, 1 Ubuntu host :) ● Nginx / Apache2, 2k reqs per sec ● 200 million page views monthly ● 700Mbps peak traffic ● Python is dominant About Brainly World’s largest homework help social network, connecting over 40 million users monthly DEV stack: ● PHP - Symfony 2 - SOA projects - 200 reqs per sec on russian version ● Erlang - 55k concurrent users - 22k events per sec ● Native Apps - iOS - Android
  • 3.
    ● Puppet wasnot feasible for us - *lots* of dependencies which make containers bigger/heavier - problems with Puppet's declarative language - seemed incoherent, lacking integration of orchestration - steep learning curve - YMMV ● "packaging as automation" as an intermediate solution - dependency hell, installing one package could result in uninstalling others - inflexible, lots of code duplication in debian/rules file - LOTS of custom bash and PHP scripts, usually very hard to reuse and not standardized - this was a dead end :( ● Ansible - initially used only for orchestration - maintaining it required keeping up2date inventory, which later simplified and helped with lots of things Starting point
  • 4.
    ● we decidedto move forward with Ansible and use it for setting up machines as well ● first project was nagios monitoring plugins setup ● turned out to be ideal for containers and our needs in general - very little dependencies to begin with (python2, python-apt), and small footprint - "configured" Python modules are transferred directly to machine, no need for local repositories - very light, no compilation on the destination host is needed - easy to understand. Tasks/playbooks map directly to actions an ops/devops would have done if he was doing it by hand - compatible with "automation by packages". We were able to migrate from the old system in small steps. First steps with Ansible
  • 5.
    ● all policies,rules, and good practices written down in automation's repo main directory ● helps with introducing new people into the team or with devops approach - newbies are able to start committing to repo quickly - what's in GUIDELINES.md, that's law and changing it requires wider consensus - gives examples on how to deal with certain problems in standardized way ● few examples: - limit the number of tags, each of them should be self-contained with no cross-dependencies. - do not include roles/tasks inside other roles, this creates hard to follow dependencies - NEVER subset the list of hosts inside the role, do it in site.yml. Otherwise debugging roles/hosts will become difficult - think twice before adding new role and esp. groups. As infrastructure grows, it becomes hard to manage and/or creates "dead” code/roles Avoiding regressions
  • 6.
    ● one ofthe policies introduced was storing one-off scripts in a separate directory in our automation repo. ● most of them are Ansible playbooks used just for one particular task (i.e. Squeeze->Wheezy migration) ● version-control everything! ● turned out to be very useful, some of them turned out to be useful enough to be rewritten to proper role or a tool Ugly-hacks reusability
  • 8.
    ● available onGitHub and Ansible Galaxy: https://galaxy.ansible.com/list#/roles/940 https://galaxy.ansible.com/list#/roles/941 ● “base” role: - is reused across 8 different production roles we have ATM - contains basic monitoring, log rotation, packages installation, etc… - includes PHP setup in modphp/prefork configuration - PHP disabled functions control - basic security setup - does not include any site-specific stuff ● "site” role: - contains all site specific stuff and dependencies (vhosts, additional packages, etc...) - usually very simple - more than one site role possible, only one base role though ● It is an example of how we make our roles reusable Apache2 automation
  • 9.
    ● automatically setupsmonitoring basing on inventory and host groups ● implements devops approach - if dev has root on machine, he also has access to all monitoring stuff related to this system ● automatic host dependencies basing on host groups ● provisioning new hosts is no longer so painful ("auto-discovery") ● all services configuration is stored as YAML files, and used in templates ● role uses DNS data directly from inventory in order to make monitoring independent of DNS failures Icinga
  • 10.
    DNS migration ● atthe beginning: - dozens of authoritative name servers, each of them having customized configuration, running ~100 zones, all created by hand - the main reason for that was using DNS for switching between primary/secondary servers/services ● three phases: - slurping configuration into Ansible - normalizing the configuration - improving the setup ● Python script which uses Ansible API to fetch normalized zone configuration from each server - results available in a neat hash, with per-host, per-zone keys! - normalization using named-checkconf tool ● use slurped configuration to re-generate all configs, this time using only the data available to Ansible's ● "push-button" migration, after all recipes were ready :)
  • 11.
    ● secure: allzone transfers are signed with individual keys, ACLs are tight ● playbooks use dns data directly from inventory ● changing/migrating slaves/masters is easy, NS records are auto-generated ● updates to zones automatically bump serial, while still preserving the YYYYMMDDxx format ● CRM records are auto-generated as well * see next slide about CRM automation ● dns entries are always up2date thanks to some custom action modules - ansible_ssh_host variables are harvested and processed into zones - only custom entries and zone primary/secondary server names are now stored in YAML - new hosts are automatically added to zones, decommissioned ones - removed - auto-generation of reverse zones DNS automation
  • 12.
    ● we have~130 CRM clusters ● setting them up by hand would be "difficult" at best, impossible at worst ● available on Ansible Galaxy: - https://galaxy.ansible.com/list#/roles/956 - https://galaxy.ansible.com/list#/roles/979 ● follows pattern from apache2_base - “base” role suitable for manually set up clusters - "cluster” role provides service upon base, with few reusable snippets and a possibility for more complex configurations ● automatic membership based on ansible inventory (no multicasts!) ● the most difficult part was providing synchronous handlers ● few simple configurations are provided, like single service-single vip Corosync & Pacemaker
  • 13.
    ● initially wedid not have time nor resources to set up full fledged LDAP ● we needed: - user should be able to log in even during a network outage - removal/adding users, ssh-keys, custom settings, etc.. all had to be supported - it had to be reusable/accessible in other roles (i.e. Icinga/monitoring) - different privileges for dev,production and other environments - UID/GID unification ● turned out to be simpler than we thought - users are managed using few simple tasks and group_vars data. Rest is handled via variables precedence. ● migration/standardization required some effort though User management automation
  • 14.
    ● standard ansibleinventory management becomes a bit cumbersome with 100’s of hosts: - each host has to have ansible_ssh_host defined - adding/removing large number of hosts/groups required editing lots of files and/or one-off scripts - ip address management using google docs does not scale ;) ● Ansible has well defined dynamic inventory API, with scripts available for AWS, Cobbler, Rackspace, Docker, and many others. ● we wrote our own, which is based on YAML file, version controlled by git: - python API allowing to manipulate the inventory easily - logic and syntax checking of the inventory ● available as opensource: https://github.com/brainly/inventory_tool Inventory management
  • 15.
    ● we areleasing our servers from Hetzner, no direct Layer 2 connectivity ● all tunnel setups are done using Ansible, new server is automatically added to our network ● firewalls are set up by Ansible as well: - OPS contribute the base firewall, DEVs can open the ports of interest for their application - ferm at it's base, for easy rule making and keeping in-kernel firewall in sync with on-disk rules - rules are auto-generated basing on inventory, adding/removing hosts is automatically reconfigures FW Networking
  • 16.
    ● based onBareos, opensource Bacula fork ● new hosts are automatically set up for backup, extending storage space is no longer a problem ● authentication using certificates, PITA without ansible Backups
  • 17.
    ● deployment doneby Python script calling Ansible API ● simple tasks implemented using ansible playbooks ● complex logic implemented in Python Deployments
  • 18.
    ● Jinja2 templateerror messages are "difficult" to interpret ● templates sometimes grow to huge complexity ● Jinja2 is designed for speed, but with tradeoffs - some Python operators are missing and creating custom plugins/filters poses some problems ● multi-inheritance, problems with 2-headed trees ● speed, improved with "pipelining=True", containerization on the long run ● some useful functionality requires paid subscription (Ansible Tower) - RESTfull API, useful if you want to push new application version to productions via i.e. Jenkins - schedules - currently we need to push the changes ourselves Not everything is perfect
  • 19.
    ● developers bydefault have RO access to repo, RW on case-by-case basis ● changes to systems owned by developers are done by developers, OPS only provide the platform and tools ● all non-trivial changes require a Pull Request and a review from Ops ● encrypt mission critical data with Ansible Vault and push it directly to the repo - *strong* encryption - available to Ansible without the need for decryption (password still required though) - all security sensitive stuff can be skipped by developers with "--skip-tags" option to ansible-playbooks Dev,DevOps,Ops
  • 21.
    ● some ofthe things we mentioned can be find on our Github account ● we are working on opensourcing more stuff https://github.com/brainly Opensource! Opensource! Opensource!
  • 22.
    ● time neededto deploy new markets dropped considerably ● increased productivity ● better cooperation with developers ● more workpower, Devs are no longer blocked so much, we can push tasks to them ● infrastructure as a code ● versioning ● code-reuse, less copy-pasting Conclusions
  • 23.
  • 24.
  • 25.