Couchbase Connect 2016

Michael Kehoe
Staff Site Reliability Engineer
LinkedIn
Going all in:
From single use-case to many

2
Overview
• The LinkedIn Story
• Couchbase Use-Cases
• Development & Operations
• Conclusions
• Questions

$ whoami
3
Michael Kehoe
• Staff Site Reliability Engineer (SRE)
• Production-SRE team
• Funny accent = Australian
• Contact
• linkedin.com/in/michaelkkehoe
• @matrixtek

$ whatis SRE
4
Michael Kehoe
• Site Reliability Engineering
• Operations for the production application environment
• Responsibilities include
• Architecture design
• Capacity planning
• Operations
• Tooling

$ whatis CBVT
5
Michael Kehoe
• Couchbase Virtual Team
• ~10 SRE’s
• 2 Software Engineers
• Sponsored by SRE Director
• 5-90% of their time to support Couchbase
• Encourage as many people to contribute as possible
• What do we do?
• Operational work on Couchbase clusters
• Evangelize the use of Couchbase within LinkedIn
• Develop tools for the Couchbase Ecosystem

6
The LinkedIn Story
• Founded in 2002, LinkedIn has grown into the world’s largest professional social
media network
• 30 offices in 24 countries, Available in 24 languages
• More than 450+ million members worldwide

7
The LinkedIn Story
• Growth in Products
• Profiles
• Groups
• Recruiter
• Sales Navigator
• Growth in Internet Traffic
• Billions of page-hits per day
• 100k+ QPS to production services

In-Memory Storage Needs
8
The LinkedIn Story
• LinkedIn started as an Oracle shop
• Hyper-growth = Scaling challenges
• Read-Scaling becomes important
• Applicable use-cases
• Simple cache store
• Pre-warmed
• Read through
• Potential for Source of Truth (SoT) store

Enter Couchbase
9
The LinkedIn Story
• Until 2012, we were only using Memcache as a non SoT In-Memory store
• Drawbacks
• Difficult to pre-warm
• No partitioning/sharding (had to write our own)
• Cold-cache restarts
• Difficult to move data across hosts/clusters data-centers

Enter Couchbase
10
The LinkedIn Story
• Evaluated replacement systems for Memcached: Mongo, Redis, and others
• Couchbase had distinct advantages:
• Simple replacement for Memcached
• Built-in replication and cluster expansion
• Automatic partitioning
• Low latency
• Async writes to disk
• Building tooling is simple

Enter Couchbase
11
The LinkedIn Story
• Today we run Couchbase in our Corporate, Staging and Production environments
• Production/ Staging statistics:
• 148 buckets
• 2821 hosts
• 10M+ QPS
• Largest Clusters:
• By Hosts: 72 Hosts
• By Documents: 1.4B Documents
• By QPS: 2.5M QPS

Summary
12
Use-Cases
Today’s use-cases:
• Simple read-through cache
• Ephemeral Counter Store
• Temporary de-duping store
• SoT data-store for internal tooling

Simple read-through cache
13
Use-Cases
• Drop-in replacement for memcache
• Read-scaling
• Protecting backend database from large amounts of traffic
• E.g. 3rd party ingestion credential cache

Counter Store
14
Use-Cases
• In certain places, we simply need to increment counters from multiple systems and
store them
• E.g. Anti-abuse/Anti-scraping systems (Fuse)

Temporary De-duping store
15
Use-Cases
• Need to de-dup data over a large application cluster
• E.g. Email systems – Ensure we don’t send the same email twice

SoT Store for Internal Tools
16
Use-Cases
• For Non-Member facing tools, we use Couchbase as a SoT store.
• Benefits:
• Schema-less
• Short setup time
• Couchbase Python Client works easily in our environment
• Use views for simple map-reduce
• Example Uses:
• Nurse – Autoremediation system
• TrafficshiftIn – Global traffic automation system
• Availability – Storing and tracking Linkedin availability data

Couchbase Ecosystem
17
The LinkedIn Story

18
Developing around Couchbase
• Java – li-couchbase-client
• Wrapper around standard Java Couchbase Client
• Custom metrics emission
• Using Spring interface
• Storing data as Java serialized objects
• Python – couchbase-python-client

19
Operational Tooling
In order to efficiently use Couchbase as SRE’s, we need the following:
• Provisioning
• Installation
• Monitoring & Alerting
• Infrastructure Visibility

Provisioning
20
Operational Tooling
• Provisioning Flow
• Seek estimated usage statistics for cluster
• Size of data to be stored
• QPS
• Redundancy Needs
• Calculate cluster sizing
• Currently done with a template
• Couchbase has a simple calculator available online: http://docs.couchbase.com/prebuilt/calculators/sizing-
calc.html
• Request hardware for cluster(s)

Installation
21
Operational Tooling
• Process
• Enter cluster metadata into our management system (Range)
• Use Salt States to install and configure cluster
• See Issa Fattah’s post for more information:
• https://engineering.linkedin.com/blog/2016/04/leveraging-saltstack-to-scale-couchbase
• Benefits
• Ability to perform ‘state enforcement’
• Using Salt Pillar’s to encrypt cluster/ bucket passwords end-to-end

Monitoring & Alerting
22
Operational Tooling
• We run a daemon on each Couchbase Server that collects metrics every minute via
Couchbase API’s
• Use cluster metadata from range to build dashboards with our own system
InGraphs
• See: ‘Monitoring production deployments’: 4pm - Great America 1

Monitoring & Alerting
23
Operational Tooling

Management
24
Operational Tooling
• We want to see a world-view of all the clusters we run
• Having bucket cluster/server level statistics is useful
• Having a global view of who owns and operates each cluster/ bucket is useful

Management
25
Operational Tooling

26
Conclusions
• Couchbase was a natural fit into our existing infrastructure
• Building an ecosystem around Couchbase was important to us and has helped
Couchbase be successful at LinkedIn
• Expanding use of Couchbase
• In the past year we’ve grown the number of buckets over 50%
• Starting to use Views in production
• Moving Couchbase into LinkedIn standard deployment infrastructure

Couchbase Connect 2016

More Related Content

What's hot

Viewers also liked

Similar to Couchbase Connect 2016

More from Michael Kehoe

Recently uploaded

Couchbase Connect 2016

Editor's Notes