2. Who Am I?
❖ Former appserver developer
❖ Started with Java, some Python, some Go
❖ Working with applications and operations on AWS since 2007
➢ (Dev)Ops fascination from around the same time
➢ Led the engineering team for a SaaS product
➢ Had the good fortune to work with some extremely smart people
❖ Interests lie in distributed systems and scalability
❖ DevOps/Cloud practice lead @ImagineaTech
❖ DevOps editor at InfoQ.com
❖ Elsewhere
➢ https://www.linkedin.com/in/hrishikeshbarua
➢ https://twitter.com/talonx
3. The Product in Question
❖ Marketing platform for brands to run customer engagement
and loyalty campaigns
❖ SaaS model
4. Technology & Infrastructure
❖ Hosted on Amazon Web Services, initially in one region, later spread over
multiple
❖ EC2, S3, EBS, CloudFront
❖ External DNS (and later CDN)
❖ Mostly Java/JavaScript/MySQL/Kafka/Redis
❖ Integration with multiple third-party APIs and services
❖ Puppet/vagrant/Jenkins/Graphite/Collectd/Nagios
5. To Set Some Context
❖ Roughly covers the period 2010 - 2014, so some things might sound
quaint today
❖ DevOps transformation took place over a period of years
➢ Started with small scale AWS infra, legacy tools, monolithic app architecture.
➢ Ended with multi-region one-click deployment, combination of mono + service oriented
architecture, OSS + custom built ops tools.
➢ The following slides are a summary of some key learnings on AWS Ops.
7. Monitoring
❖ Monitoring-as-a-Service or Self-hosted?
➢ You might need both, if you have a complex/legacy + modern app or want more flexibility.
➢ Monitor the self-hosted monitor using the external one.
➢ Self-hosted monitoring tools and dashboards should have backups. If the AWS AZ in which
you host your monitoring system goes down, you’ll be semi-blind.
❖ Choose the right tools
➢ Get rid of the dinosaur. Convincing your traditional IT folks about jettisoning Nagios might
be the toughest part.
➢ Relational view is important. A single service might be dependent on others (e.g. a REST API
dependent on DNS, LB, backend nodes, database, caching layer) - it’s important to be able
to see this relationship in your dashboard.
8. Monitoring
❖ Watch out for AWS specific quirks
➢ Steal time? Alerting software needs to take this into account.
❖ There’s no such thing as too much monitoring
➢ Monitor the AWS RSS feed - can serve as an indicator of potential problems. Caveats
■ AWS Problems are sometimes localized.
■ This can at best serve as an early warning system.
➢ Collect and plot everything
■ Deployment points (Thanks, Etsy)
■ Graphite is a swallow-all, easy to use system
9. Monitoring
❖ Automate
➢ The provisioning process for a server (or a service) should take care of including it in your
monitoring system.
10. Backups and Disaster Recovery
❖ Specifics usually depend on the app architecture and the level of
automation
❖ Instances
➢ Base AMI + Configuration Management? (Puppet/Chef/Ansible)
➢ Golden images + Immutable Servers?
➢ All of the above?
11. Backups & Disaster Recovery
❖ Databases
➢ Self-hosted vs RDS
■ RDS limitations
➢ Replication, EBS snapshots
➢ Data consistency
■ Freeze/unfreeze
■ Database specific quirks for snapshotting
■ Snapshotting the read-only slave? Ensure that the lag time is low (and monitored)
■ Cross region backups (but is your app cross-region ready? If not, why bother?)
12. Security
❖ Go with VPC (older AWS accounts have both Classic and VPC)
❖ Amazon provides the first level of defence
➢ Strong network component for DDOS, rest depends on you
➢ Plan security groups from the beginning
13. Security
❖ ssh keys
➢ Adopt a tool to manage per-user ssh keys
➢ EC2 metadata for instance(s) will continue to show the original keypair name it was
created with. The original public key may not even exist on the instance anymore if
revoked, but the metadata will show it. This is because AWS has no way of knowing that
you changed the authorized_keys file.
➢ You can upload your own keys to the AWS console and they will be available for use while
launching EC2 instances. Your generated keys have to be RSA keys of 1024, 2048 or 4096
bits.
14. Security
❖ ssh keys
➢ Are AWS key-pairs confined to a single region? This is true only if you consider the default
state of affairs. You can get around it.
■ For keys that you generate, you can import them to all the regions you want using
the AWS console or the CLI tools.
■ For keys that AWS generates, you can take the public key from an EC2 instance
launched with that key, and import that in a similar manner to all the regions you
want.
15. Automation
❖ CI
➢ Easy to set up, no excuses. Once set up, have an owner for incremental improvements
➢ Don’t let Broken Windows remain broken
➢ The move to CD may not be so easy - needs buy-in from all quarters
❖ Configuration Management
➢ Again, hard to do if not done from the beginning
➢ Choose one (Ansible/Puppet/Chef) and master it
16. People & Architecture
❖ Have an owner for system architecture
➢ All architecture decisions however small, matter
➢ And most such decisions need to be taken “urgently”
❖ Buy-in from management
➢ Demonstrate value to the product/business. Visibility is paramount. Don’t expect to be
understood all the time.
➢ “Make more awesome” - Jesse Robbins
17. People & Architecture
❖ Adopt uniform abstractions
➢ E.g. Don’t adopt two different queueing software for two different purposes if one can
handle both (“cool stuff syndrome”).
❖ Cross region failover is hard if not designed early
➢ Specifics will depend on your product