Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Couchbase at LinkedIn: Couchbase Connect 2015

2,378 views

Published on

LinkedIn's mission is to connect the world's professionals to make them more productive and successful. LinkedIn members use the company's products to get access to people, jobs, news, updates, and insights that help them be great at what they do. To support its goals on an engineering level, LinkedIn services must sustain high levels of QPS while providing data integrity. In this talk, we will discuss how LinkedIn uses Couchbase to help with read scaling and performance of its high impact services. We will also talk about some tooling we have created to integrate Couchbase into our systems and how we operationally manage our Couchbase clusters. Finally, we will explore some future uses of Couchbase within our environment.

Published in: Technology
  • Be the first to comment

Couchbase at LinkedIn: Couchbase Connect 2015

  1. 1. Michael Kehoe Brian Cory Sherwin LinkedIn Couchbase at LinkedIn 2015
  2. 2. 3 Overview • The LinkedIn Story • Development & Operations • Operational Tooling • LinkedIn’s Couchbase as a Database • Questions
  3. 3. • Site Reliability Engineer (SRE) at LinkedIn • SRE for Profile & Higher-Education • Member of CBVT • B.E. (Electrical Engineering) from the University of Queensland, Australia 4 Michael Kehoe
  4. 4. 5 The LinkedIn Story • Founded in 2002, LinkedIn has grown into the world’s largest professional social media network • Offices in 24 countries, Available in 23 languages • Over 360M members • Revenue of $638M in Q1 2015
  5. 5. In-Memory storage needs 6 The LinkedIn Story • At our scale, it becomes challenging to scale data systems • Read-Scaling becomes important • Applicable use-cases: • Simple cache store • Pre-warmed • Read through • Temporary data storage for de-duping • Potential for Source of Truth (SoT) store
  6. 6. Enter Couchbase 7 The LinkedIn Story • Until 2012, we were only using Memcached as a non SoT In-Memory store • However it had some drawbacks; • Long cache warmup times • No partitioning/sharing – Had to write our own • Cold-cache restarts • Difficult to move data across hosts/clusters/datacentres
  7. 7. Enter Couchbase 8 The LinkedIn Story • Evaluated systems to replace Memcached: Mongo, Redis, and others • Couchbase had advantages • Drop-in replacement for Memcached • Built in replication and cluster expansion • Memory latency for operations • Asynchronous writes to disk • Utilize some of the development infrastructure we’ve built
  8. 8. Coding 9 Development & Operations • Memcached configured with Spring and implements a caching Java interface • Implemented with Couchbase Native Client • Developer just replaces the Spring
  9. 9. Operations 10 Development & Operations • Hadoop jobs build warm cache data • Tools to partition the data and load into Couchbase offline • Apply deltas when brought on-line • Clean, warm caches ready when needed
  10. 10. 11 Operational Tooling • In order to efficiently use Couchbase as SRE’s, we need the following: • Provisioning • Installation • Monitoring & Alerting • Infrastructure Visibility
  11. 11. Provisioning 12 Operational Tooling • Provisioning Flow • Seek estimated usage statistics on cluster • Size of data to be stored • QPS • Redundancy Needs • Calculate cluster sizing • Currently done via a spreadsheet with a template • Moving into an in-house application • Request hardware for cluster(s)
  12. 12. Installation 13 Operational Tooling • Current System • Enter cluster metadata into our management system (Yahoo range) • Use SALT module to install & configure cluster • Future System • Use same metadata system • Use SALT States to install and configure cluster • Benefits of the new system • It’s possible to have ‘state enforcement’ • Use SALT Pillar’s to encrypt cluster/bucket passwords
  13. 13. Installation 14 Operational Tooling CLUSTER: - ela4.couchbase.30 - prod-lva1.couchbase.30 - prod-ltx1.couchbase.30 NAME: follow-blue PORT: 11211 INSTANCE: 30 ALERT_ADDRESSES: - q(some-sre-team@linkedin.com) SRE_GROUPS: - sre-team-name CLIENT_CONTAINERS: - following-services EMAIL_ALERTS: - HIGHWATER_PERCENT_FULL - MEMORY_PERCENT_FULL - NOT_MY_VBUCKET - PERCENT_IN_MEMORY - KEY_USAGE - AUTOFAILOVER
  14. 14. Monitoring & Alerting 15 Operational Tooling • We run a daemon on each Couchbase Server that collects metrics every minute via a Couchbase Library API • Use cluster metadata from range to build dashboard definition file via Jinja template & Python
  15. 15. Monitoring & Alerting 16 Operational Tooling $ ./couchbase.py –I 30 [INFO] Generating dashboard file: common-templates/couchbase.follow-blue
  16. 16. Monitoring & Alerting 17 Operational Tooling - title: couchbase.follow-blue AutoFailover Enabled defs: - range: "%{FABRIC}.couchbase.30" label: "autofailover_enabled" rrd: couchbase.follow-blue/autofailover_enabled.rrd params: vlabel: 'enabled_boolean' autoalerts: zones: ['COUCHBASE-SLA2'] enabled-fabrics: ['ela4', 'prod-lva1', 'prod-ltx1'] processor: 'ingraphs' filter-type: 'ingraphs_filter' contacts: [‘couchbase-team@linkedin.com'] state-check: threshold state-check-args: min: 1.0 consecutive-events: 10 alert-plugin: emailer alert-plugin-args: recipients: [’some-sre-team@linkedin.com’] interval: 3600 include-definition: True
  17. 17. Monitoring & Alerting 18 Operational Tooling
  18. 18. Management 19 Operational Tooling • We want to see a world-view of all the clusters that we run • Having bucket cluster/server level statistics are useful • Having a view of who owns each cluster/bucket is useful
  19. 19. Management 20 Operational Tooling
  20. 20. Management 21 Operational Tooling
  21. 21. Management 22 Operational Tooling
  22. 22. Management 23 Operational Tooling
  23. 23. Management 24 Operational Tooling
  24. 24. Management 25 Operational Tooling
  25. 25. 26 Conclusions • Couchbase fits into our existing infrastructure • We have good management and monitoring of the clusters • Rich set of tooling we extended for our environment • Starting to expand our use from a cache to a store for internal tooling
  26. 26. Brian Cory Sherwin Site Reliability Engineer LinkedIn LinkedIn’s Couchbase as a Database
  27. 27. • Our use case and requirements • Why we chose Couchbase vs MySQL • Pitfalls encountered 28 The Agenda
  28. 28. Memcache replacement • Data resiliency • Maintenance friendly 29 Couchbase @ Linkedin
  29. 29. AutoRemediation! A job execution platform to remediate operations issues • Database backend for state tracking of a workflow engine 30 Using Couchbase as a Workflow Backend
  30. 30. • Easy JSON documents • Rapid iteration • Horizontally scalable 31 Our Requirements
  31. 31. Couchbase as a database • Document store • Views for indexing • Data resiliency • Replication • Simplicity 32 Why Couchbase?
  32. 32. • Upfront cost in creating the schema • Rapidly changing documents • Number of columns • Consistent incremental updates 33 Why not MySQL?
  33. 33. • ACID implications • Durability and Consistency • Concurrency • Different and new tech 34 Pitfalls using Couchbase
  34. 34. Questions? bsherwin@linkedin.com If you want to learn more on AutoRemediaiton http://www.meetup.com/Auto- Remediation-and-Event-Driven- Automation/ 35 Questions?

×