Palomino provides consulting and operational support services for Couchbase databases. Their team has over 100 years of experience with distributed, scalable systems. Services include configuration management, change management, availability management, monitoring, backups, and 24/7 operational support. They can help with Couchbase proof of concepts, migrations, cluster sizing, and infrastructure builds.
Scaling API-first – The story of a global engineering organization
CouchConf SF 2012 Lightning Talk - Operational Excellence
1. Laine Campbell, Owner/Principal, laine@palominodb.com
Charlie Killian, Director of Engineering, charlie@palominodb.com
Scaling and Performance for
Operational Excellence
2. Who we are
● A boutique consultancy offering custom solutions.
● An operations support team providing a combined
100+ years of experience in distributed, performant
and scalable solutions.
● A team of architects, engineers and operators who
have worked at some of the most trafficked sites,
games and companies since 1999.
*
3.
4. Operational Excellence
● Configuration management and documentation.
● Change management.
● Availability management.
● Incident and problem management
● Backup, recovery and business continuity.
● Monitoring and Trending.
*
5. Configuration Management
● Consistent couchbase configurations.
○ Guis are great, but don't meet automation needs.
● Self documenting environments.
● Incorporating your infrastructure into your application
to leverage couchbase ease of scale.
● Chef, puppet, ansible or "roll your own" using the
couchbase API.
*
6. Change and Release
Management
● Schemaless is great, but data governance is key.
● Your code needs to build a data dictionary or
confusion reigns.
● DevOps style relationships build collaboration that
can overcome the wild west mentality of schemaless
environments.
*
7. Availability Management
● Moxi provides availability during node failures,
supporting reads and writes.
● XDCR support in Couchbase 2.0 provides availability
across datacenters and regions in an active/active
topology.
● Special consideration in cloud environments must
take into account AZ and region failovers.
*
8. Incident and Problem
Management
● While not Couchbase specific, crucial to maintaining
any highly available architecture.
● Appropriate alerting, response and communication
processes ensure that isolated issues don't cascade
into massive failures.
● Failing hardware, networks, design issues can all
cause failures that can cascade into an entire cluster
being down.
● Tracking recurring problems help with a continuous
improvement on meeting SLAs.
*
9.
10. Backup and Recovery
● Define your recovery SLAs.
● Track how long backups take.
● Test restores and track how long they take.
● Recognize all failure scenarios:
○ Node failure
○ Physical data corruption
○ Logical data corruption
○ Audits and forensics
*
11. Backup and Recovery 1.8
● In 1.8, per node backup is supported. Replica sets
are also backed-up, which can cause long, or non-
completing backups.
● SQLite3 can be used as a logical dump to ease
backups.
● Cluster-wide consistency can not be guaranteed.
● No incremental backups available.
*
12. Backup and Recovery 2.0
● Cluster wide backups are now available, as well as
incremental.
● EBS snapshots (or LVM, hardware, etc...) work well
due to log-style writes to disk.
● With incremental, it is easier to meet SLAs without
breaking the bank on storage.
*
13. Monitoring and Alerting
● Use logs! Centralized syslogs, splunk, custom
scripts to identify and track error types and rates.
● Track your app! Latency of web pages, forms and
api-calls are key indicators.
● Define key alerts, make them actionable and tied to
documentation.
● Palomino builds plugins and templates to provide
proper alerts that are useful and work!
*
14. Trending and Diagnostics
● Alerts aren't enough, you must track usage and
internal metrics to understand trends, workloads and
bottlenecks.
● Graph everything! All exposed metrics, trend health
checks.
● Interleave graphs from internal metrics to external
factors: Code pushes, application metrics (logins,
purchases, api calls)
*
15. Care and Feeding
● Regular performance reviews.
● Defragmentation.
● Incorporate recovery tests into building test and dev
environments.
● Scale-up/Scale-down, preferably via automated
processes.
● Rolling upgrades.
● Coffee, pie, beer.
*
16.
17. Partnering with Couchbase
Providing remote Architecture, Engineering and DBA
services to clients.
Vendor neutral operations and scaling expertise for
Couchbase clients in need of operators.
*
18. Remote Architecture and
Engineering Services
● Architecture review and recommendations
● Data modeling
● Data model migration
● Data migration
● Cluster sizing
● Tools development
*
19. DBA and Operations Services
● Infrastructure builds and management
● Proactive operational support
● 24x7 operational support with 30 minutes SLA
● System health checks
● Backup and recovery
● Tuning for performance and scale
● Query reviews, indexing, benchmarking
● Capacity reviews
*
20. How we can help
● Support your proof of concept
● Migrate you to Couchbase Server
● Support your Couchbase Server clusters
*
21. Is Couchbase Server a good fit?
● Architecture review
● Data model review
● Recommendation on moving to Couchbase Server
● Data access best practices
*
22. Migrating from a RDBMS to Couchbase
Server?
● Data model migration from relational to document
● Data migration from SQL Server to Couchbase
Server
● Couchbase Server cluster sizing
● Infrastructure builds
*
23. Do you need operational experts?
● 24x7 operational support with 30 minutes SLA
● Multiple Couchbase Server 1.8 clusters
● Wanted Couchbase operational experts
● Escalate to Couchbase for software support
*
24.
25. Contact Info
Laine Campbell, laine@palominodb.com
Charlie Killian, charlie@palominodb.com
www.palominodb.com
@palominodb on Twitter
*