Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Couchbase Connect 2016

  1. 1. Michael Kehoe Staff Site Reliability Engineer LinkedIn Going all in: From single use-case to many
  2. 2. 2 Overview • The LinkedIn Story • Couchbase Use-Cases • Development & Operations • Conclusions • Questions
  3. 3. $ whoami 3 Michael Kehoe • Staff Site Reliability Engineer (SRE) • Production-SRE team • Funny accent = Australian • Contact • linkedin.com/in/michaelkkehoe • @matrixtek
  4. 4. $ whatis SRE 4 Michael Kehoe • Site Reliability Engineering • Operations for the production application environment • Responsibilities include • Architecture design • Capacity planning • Operations • Tooling
  5. 5. $ whatis CBVT 5 Michael Kehoe • Couchbase Virtual Team • ~10 SRE’s • 2 Software Engineers • Sponsored by SRE Director • 5-90% of their time to support Couchbase • Encourage as many people to contribute as possible • What do we do? • Operational work on Couchbase clusters • Evangelize the use of Couchbase within LinkedIn • Develop tools for the Couchbase Ecosystem
  6. 6. 6 The LinkedIn Story • Founded in 2002, LinkedIn has grown into the world’s largest professional social media network • 30 offices in 24 countries, Available in 24 languages • More than 450+ million members worldwide
  7. 7. 7 The LinkedIn Story • Growth in Products • Profiles • Groups • Recruiter • Sales Navigator • Growth in Internet Traffic • Billions of page-hits per day • 100k+ QPS to production services
  8. 8. In-Memory Storage Needs 8 The LinkedIn Story • LinkedIn started as an Oracle shop • Hyper-growth = Scaling challenges • Read-Scaling becomes important • Applicable use-cases • Simple cache store • Pre-warmed • Read through • Potential for Source of Truth (SoT) store
  9. 9. Enter Couchbase 9 The LinkedIn Story • Until 2012, we were only using Memcache as a non SoT In-Memory store • Drawbacks • Difficult to pre-warm • No partitioning/sharding (had to write our own) • Cold-cache restarts • Difficult to move data across hosts/clusters data-centers
  10. 10. Enter Couchbase 10 The LinkedIn Story • Evaluated replacement systems for Memcached: Mongo, Redis, and others • Couchbase had distinct advantages: • Simple replacement for Memcached • Built-in replication and cluster expansion • Automatic partitioning • Low latency • Async writes to disk • Building tooling is simple
  11. 11. Enter Couchbase 11 The LinkedIn Story • Today we run Couchbase in our Corporate, Staging and Production environments • Production/ Staging statistics: • 148 buckets • 2821 hosts • 10M+ QPS • Largest Clusters: • By Hosts: 72 Hosts • By Documents: 1.4B Documents • By QPS: 2.5M QPS
  12. 12. Summary 12 Use-Cases Today’s use-cases: • Simple read-through cache • Ephemeral Counter Store • Temporary de-duping store • SoT data-store for internal tooling
  13. 13. Simple read-through cache 13 Use-Cases • Drop-in replacement for memcache • Read-scaling • Protecting backend database from large amounts of traffic • E.g. 3rd party ingestion credential cache
  14. 14. Counter Store 14 Use-Cases • In certain places, we simply need to increment counters from multiple systems and store them • E.g. Anti-abuse/Anti-scraping systems (Fuse)
  15. 15. Temporary De-duping store 15 Use-Cases • Need to de-dup data over a large application cluster • E.g. Email systems – Ensure we don’t send the same email twice
  16. 16. SoT Store for Internal Tools 16 Use-Cases • For Non-Member facing tools, we use Couchbase as a SoT store. • Benefits: • Schema-less • Short setup time • Couchbase Python Client works easily in our environment • Use views for simple map-reduce • Example Uses: • Nurse – Autoremediation system • TrafficshiftIn – Global traffic automation system • Availability – Storing and tracking Linkedin availability data
  17. 17. Couchbase Ecosystem 17 The LinkedIn Story
  18. 18. 18 Developing around Couchbase • Java – li-couchbase-client • Wrapper around standard Java Couchbase Client • Custom metrics emission • Using Spring interface • Storing data as Java serialized objects • Python – couchbase-python-client
  19. 19. 19 Operational Tooling In order to efficiently use Couchbase as SRE’s, we need the following: • Provisioning • Installation • Monitoring & Alerting • Infrastructure Visibility
  20. 20. Provisioning 20 Operational Tooling • Provisioning Flow • Seek estimated usage statistics for cluster • Size of data to be stored • QPS • Redundancy Needs • Calculate cluster sizing • Currently done with a template • Couchbase has a simple calculator available online: http://docs.couchbase.com/prebuilt/calculators/sizing- calc.html • Request hardware for cluster(s)
  21. 21. Installation 21 Operational Tooling • Process • Enter cluster metadata into our management system (Range) • Use Salt States to install and configure cluster • See Issa Fattah’s post for more information: • https://engineering.linkedin.com/blog/2016/04/leveraging-saltstack-to-scale-couchbase • Benefits • Ability to perform ‘state enforcement’ • Using Salt Pillar’s to encrypt cluster/ bucket passwords end-to-end
  22. 22. Monitoring & Alerting 22 Operational Tooling • We run a daemon on each Couchbase Server that collects metrics every minute via Couchbase API’s • Use cluster metadata from range to build dashboards with our own system InGraphs • See: ‘Monitoring production deployments’: 4pm - Great America 1
  23. 23. Monitoring & Alerting 23 Operational Tooling
  24. 24. Management 24 Operational Tooling • We want to see a world-view of all the clusters we run • Having bucket cluster/server level statistics is useful • Having a global view of who owns and operates each cluster/ bucket is useful
  25. 25. Management 25 Operational Tooling
  26. 26. 26 Conclusions • Couchbase was a natural fit into our existing infrastructure • Building an ecosystem around Couchbase was important to us and has helped Couchbase be successful at LinkedIn • Expanding use of Couchbase • In the past year we’ve grown the number of buckets over 50% • Starting to use Views in production • Moving Couchbase into LinkedIn standard deployment infrastructure
  27. 27. 27 Thank You Questions?
  28. 28. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Editor's Notes

  • The LinkedIn Story
    Couchbase Use-Cases
    Development & Operations
    Conclusions
    Questions
  • Site Reliability Engineering
    A term coined by Ben Treynor from Google
    Hybrid of:
    Sysadmin
    Network Engineer
    Architect
    Troubleshooter
    Software Engineer
    Ninja’s – Digital economy
  • 10 SRE’s, with a tech-lead
    Sponsored by a SRE Director
    Input from Software Engineers on development
  • Founded in 2002, LinkedIn has grown into the world’s largest professional social media network
    30 offices in 24 countries, Available in 24 languages
    More than 450+ million members worldwide
  • LinkedIn started as an Oracle shop
    To-date, we still run a significant number of Oracle databases
    Oracle is fine for writes, scaling reads becomes challenging
    HyperGrowth == Scaling challenges
    Scaling writes isn’t a common problem in most cases
    Scaling reads to 100k+ QPS, is challenging
    Failures in read-scaling infra can take down back-end systems
    Applicable use-cases
    Simple cache store
    Pre-warmed
    Read-through
    SoT Store




  • Until 2012, we were only using Memcache as a non SoT In-Memory store
    Drawbacks of memcache:
    Difficult to pre-warm, not easy to copy-data
    No native sharding for clusters, had to write our own
    Restarting memcache servers caused problems
    Couldn’t copy data across for new DC’s, expanding clusters etc
    Mid-2012, started testing Couchbase
  • Evaluated replacement systems for Memcached: Mongo, Redis, and others
    Couchbase had distinct advantages
    Simple replacement for memcache  JAVA Spring made this simpler
    Built-in replication and cluster expansion, significantly reduces ops-workload
    Automatic partitioning, doesn’t become a concern anymore
    Low-latency, reads from disk are still very fast
    Async write to disk, can write a low of data at once without it being a problem
    Lots of API’s that make tooling relatively simple


  • Insert fuse architecture
  • We have a deduplication filter in stork that you can take advantage of to make sure we don't send duplicates of your email. This is highly recommended for any email using kafka (kafka can potentially deliver your email to our system twice)

  • Don’t use as SoT store as Espresso is our primary key-value store
  • ×