PLNOG Automation@Brainly

Automation at Brainly
… or how to enter the world of automation in a “different way”.

OPS stack:
● ~80 servers, heavy usage of LXC containers
(~1000)
● 99.9% Debian, 1 Ubuntu host :)
● Nginx / Apache2, 2k reqs per sec
● 200 million page views monthly
● 700Mbps peak traffic
● Python is dominant
About Brainly
World’s largest homework help social network, connecting over 40 million users monthly
DEV stack:
● PHP
－ Symfony 2
－ SOA projects
－ 200 reqs per sec on russian version
● Erlang
－ 55k concurrent users
－ 22k events per sec
● Native Apps
－ iOS
－ Android

● Puppet was not feasible for us
- *lots* of dependencies which make containers bigger/heavier
- problems with Puppet's declarative language
- seemed incoherent, lacking integration of orchestration
- steep learning curve
- YMMV
● "packaging as automation" as an intermediate solution
- dependency hell, installing one package could result in uninstalling others
- inflexible, lots of code duplication in debian/rules file
- LOTS of custom bash and PHP scripts, usually very hard to reuse
and not standardized
- this was a dead end :(
● Ansible
- initially used only for orchestration
- maintaining it required keeping up2date inventory, which later
simplified and helped with lots of things
Starting point

● we decided to move forward with Ansible and use it for setting up machines as
well
● first project was nagios monitoring plugins setup
● turned out to be ideal for containers and our needs in general
- very little dependencies to begin with (python2, python-apt),
and small footprint - "configured" Python modules are transferred
directly to machine, no need for local repositories
- very light, no compilation on the destination host is needed
- easy to understand. Tasks/playbooks map directly to actions
an ops/devops would have done if he was doing it by hand
- compatible with "automation by packages". We were able to
migrate from the old system in small steps.
First steps with Ansible

● all policies, rules, and good practices written down in automation's repo main
directory
● helps with introducing new people into the team or with devops approach
- newbies are able to start committing to repo quickly
- what's in GUIDELINES.md, that's law and changing it requires wider
consensus
- gives examples on how to deal with certain problems in standardized way
● few examples:
- limit the number of tags, each of them should be self-contained
with no cross-dependencies.
- do not include roles/tasks inside other roles,
this creates hard to follow dependencies
- NEVER subset the list of hosts inside the role, do it in site.yml.
Otherwise debugging roles/hosts will become difficult
- think twice before adding new role and esp. groups. As infrastructure
grows, it becomes hard to manage and/or creates "dead” code/roles
Avoiding regressions

● one of the policies introduced was storing one-off scripts in a
separate directory in our automation repo.
● most of them are Ansible playbooks used just for one particular
task (i.e. Squeeze->Wheezy migration)
● version-control everything!
● turned out to be very useful, some of them turned out to be useful
enough to be rewritten to proper role or a tool
Ugly-hacks reusability

● available on GitHub and Ansible Galaxy:
https://galaxy.ansible.com/list#/roles/940
https://galaxy.ansible.com/list#/roles/941
● “base” role:
- is reused across 8 different production roles we have ATM
- contains basic monitoring, log rotation, packages installation, etc…
- includes PHP setup in modphp/prefork configuration
- PHP disabled functions control
- basic security setup
- does not include any site-specific stuff
● "site” role:
- contains all site specific stuff and dependencies
(vhosts, additional packages, etc...)
- usually very simple
- more than one site role possible, only one base role though
● It is an example of how we make our roles reusable
Apache2 automation

● automatically setups monitoring basing on inventory and host groups
● implements devops approach - if dev has root on machine, he also has
access to all monitoring stuff related to this system
● automatic host dependencies basing on host groups
● provisioning new hosts is no longer so painful ("auto-discovery")
● all services configuration is stored as YAML files, and used in templates
● role uses DNS data directly from inventory in order to make monitoring
independent of DNS failures
Icinga

DNS migration
● at the beginning:
- dozens of authoritative name servers, each of them having
customized configuration, running ~100 zones, all created by hand
- the main reason for that was using DNS for switching between
primary/secondary servers/services
● three phases:
- slurping configuration into Ansible
- normalizing the configuration
- improving the setup
● Python script which uses Ansible API to fetch normalized zone configuration from
each server
- results available in a neat hash, with per-host, per-zone keys!
- normalization using named-checkconf tool
● use slurped configuration to re-generate all configs, this time using only the data
available to Ansible's
● "push-button" migration, after all recipes were ready :)

● secure: all zone transfers are signed with individual keys, ACLs are tight
● playbooks use dns data directly from inventory
● changing/migrating slaves/masters is easy, NS records are auto-generated
● updates to zones automatically bump serial, while still preserving the
YYYYMMDDxx format
● CRM records are auto-generated as well
* see next slide about CRM automation
● dns entries are always up2date thanks to some custom action modules
- ansible_ssh_host variables are harvested and processed into zones
- only custom entries and zone primary/secondary server names are
now stored in YAML
- new hosts are automatically added to zones, decommissioned
ones - removed
- auto-generation of reverse zones
DNS automation

● we have ~130 CRM clusters
● setting them up by hand would be "difficult" at best, impossible at worst
● available on Ansible Galaxy:
- https://galaxy.ansible.com/list#/roles/956
- https://galaxy.ansible.com/list#/roles/979
● follows pattern from apache2_base
- “base” role suitable for manually set up clusters
- "cluster” role provides service upon base, with few reusable snippets
and a possibility for more complex configurations
● automatic membership based on ansible inventory (no multicasts!)
● the most difficult part was providing synchronous handlers
● few simple configurations are provided, like single service-single vip
Corosync & Pacemaker

● initially we did not have time nor resources to set up full fledged LDAP
● we needed:
- user should be able to log in even during a network outage
- removal/adding users, ssh-keys, custom settings, etc..
all had to be supported
- it had to be reusable/accessible in other roles
(i.e. Icinga/monitoring)
- different privileges for dev,production and other environments
- UID/GID unification
● turned out to be simpler than we thought - users are managed using few
simple tasks and group_vars data. Rest is handled via variables precedence.
● migration/standardization required some effort though
User management automation

● standard ansible inventory management becomes a bit cumbersome with 100’s of
hosts:
- each host has to have ansible_ssh_host defined
- adding/removing large number of hosts/groups required editing lots of files
and/or one-off scripts
- ip address management using google docs does not scale ;)
● Ansible has well defined dynamic inventory API, with scripts available for AWS,
Cobbler, Rackspace, Docker, and many others.
● we wrote our own, which is based on YAML file, version controlled by git:
- python API allowing to manipulate the inventory easily
- logic and syntax checking of the inventory
● available as opensource: https://github.com/brainly/inventory_tool
Inventory management

● we are leasing our servers from Hetzner, no direct Layer 2 connectivity
● all tunnel setups are done using Ansible, new server
is automatically added to our network
● firewalls are set up by Ansible as well:
- OPS contribute the base firewall, DEVs can open
the ports of interest for their application
- ferm at it's base, for easy rule making and keeping in-kernel firewall in sync
with on-disk rules
- rules are auto-generated basing on inventory, adding/removing hosts is
automatically reconfigures FW
Networking

● based on Bareos, opensource Bacula fork
● new hosts are automatically set up for backup,
extending storage space is no longer a problem
● authentication using certificates, PITA without ansible
Backups

● deployment done by Python script calling Ansible API
● simple tasks implemented using ansible playbooks
● complex logic implemented in Python
Deployments

● Jinja2 template error messages are "difficult" to interpret
● templates sometimes grow to huge complexity
● Jinja2 is designed for speed, but with tradeoffs - some Python operators are
missing and creating custom plugins/filters poses some problems
● multi-inheritance, problems with 2-headed trees
● speed, improved with "pipelining=True", containerization on the long run
● some useful functionality requires paid subscription (Ansible Tower)
- RESTfull API, useful if you want to push new application version
to productions via i.e. Jenkins
- schedules - currently we need to push the changes ourselves
Not everything is perfect

● developers by default have RO access to repo, RW on case-by-case basis
● changes to systems owned by developers are done by developers,
OPS only provide the platform and tools
● all non-trivial changes require a Pull Request and a review from Ops
● encrypt mission critical data with Ansible Vault and push it directly to the repo
- *strong* encryption
- available to Ansible without the need for decryption
(password still required though)
- all security sensitive stuff can be skipped by developers with
"--skip-tags" option to ansible-playbooks
Dev,DevOps,Ops

● some of the things we mentioned can be find on our Github account
● we are working on opensourcing more stuff
https://github.com/brainly
Opensource! Opensource! Opensource!

● time needed to deploy new markets dropped considerably
● increased productivity
● better cooperation with developers
● more workpower, Devs are no longer blocked so much, we can push
tasks to them
● infrastructure as a code
● versioning
● code-reuse, less copy-pasting
Conclusions

We are hiring!
http://brainly.co/jobs/

PLNOG Automation@Brainly

More Related Content

What's hot

Viewers also liked

Similar to PLNOG Automation@Brainly

Recently uploaded

PLNOG Automation@Brainly