How robust monitoring powers high availability for linkedin feed

•

1 like•535 views

It is common practice for networking services like LinkedIn to introduce a new feature to a small subset of users before deploying to all its members. Even with rigorous testing and tight processes, bugs are inevitably introduced with a new deploy. Unit and integration tests in development environment cannot completely cover all use cases. This results in an inferior experience for the users subject to this treatment. In addition, these bugs cause service outages, impacting service availability, health metrics and increased on-call burden. Although there exist frameworks for testing new features in development clusters, it is not possible to completely simulate all aspects of a production environment. Read requests for Feed service have enough variation to trigger large classes of unforeseen errors. So, we introduced a production instance of the service, called ‘Dark Canary’, that receives live production traffic that is tee’d from a production source node, but does not return any responses upstream. New experimental code is deployed to the dark canary and compared to the current version in production node for health failures. This helps us in reducing the number of bad sessions to the users and also provides a better understanding of the service capacity requirements.

Technology

How Robust Monitoring Powers High
Availability for Linkedin Feed
Rushin Barot
SREcon17
LinkedIn

Overview
● Linkedin Feed architecture
● Monitoring
● High availability for Feed
● Best practices

LinkedIn in 2016
● 467 million member accounts
● 107 million active users, 40 million+ messaging conversations per week
● Kafka: 1.4 trillion messages/day across 1400 brokers
● 2-3x increase in referral traffic YoY

Linkedin Feed Architecture
LinkedIn
Business Logic
Mid Tier
TimeOrderd
DB
Data ingestion
Linkedin
Business Logic
DataStore

Monitoring
Load Avg Errors
Latency SaturationInstances
What do we monitor

Monitoring
● Ingraphs ----------> Frontend to round-robin database(rrd)
● Autometrics ----------> Self-service metrics collections
● Nurse ----------> Auto Remediation
● Iris ----------> Similar to pagerDuty
How do we monitor

High Availability for Feed
● Data partition ---------------> 720 partitions
● Replicas ---------------> 3 replicas
● Back up and Restore ---------------> separate back up nodes
● Quotas ---------------> throttle
● Fail-over
● Live load test

Best Practices
● Automated hardware issue detection.
● A/B testing
● Proof rules( aka Error budgeting )
● Canaries
● Dark Canaries

Dark Canary
● Mirror server in production which receives a copy of production traffic
● Responses are never sent to clients
● Source server and dark canary

Dark Canary
● Read-Only Requests
● Scalability using traffic multiplier
● Dark Canary failures have no impact on production
● Differentiation from production traffic
○ Page key: dark canary requests have separate page keys.
○ Tree id: Separate tree id
Specifics for Feed

● New build verification
○ Detection of schema changes, code issues, misconfigurations
● New colo certification
● Warming up caches in multi-colo environment
● Capacity planning
Dark Canary
Usage

Dark Canary
Capacity Measurement via Redliner
● Use live traffic in production environment
● No significant impact on production node
● Minimal maintenance overhead

17
Dark Canary
Capacity Measurement via Redliner

Advantages and Limitations
Pros
● Proactive measure to catch issues
● Low cost of testing with real production data and traffic patterns
● No tests implementation/maintenance is required
● Configurable traffic multiplier
● New code can be deployed and tested in parallel with staging before pushing it to
regular Canary
Cons
● Read-Only requests
● Recommended for services closer to the bottom of the stack due to calls to production
downstream services
● Additional hardware for Dark Canary

References
Linkedin Tech Blog
● Followfeed Architecture
● Nurse
● Ingraphs
● Autometrics
Books
Site Reliability Engineering
Paper
Capacity Measurement and Planning for NoSQL Databases by Ramya Pasumarti, Rushin Barot, Susie Xia, Ang
Xu, Haricharan Ramachandra [ IEEE SDI 2017]

Recently uploaded

Developing An App To Navigate The Roads of BrazilV3cube

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

GenCyber Cyber Security Day PresentationMichael W. Hawkins

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

GenAI Risks & Security Meetup 01052024.pdflior mazor

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil

Powerful Google developer tools for immediate impact! (2023-24 C)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

Driving Behavioral Change for Information Management through Data-Driven Gree...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Apidays New York 2024 - The value of a flexible API Management solution for O...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

[2024]Digital Global Overview Report 2024 Meltwater.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

AWS Community Day CPH - Three problems of Terraform

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

GenCyber Cyber Security Day Presentation

🐬 The future of MySQL is Postgres 🐘

Data Cloud, More than a CDP by Matt Robison

Advantages of Hiring UIUX Design Service Providers for Your Business

GenAI Risks & Security Meetup 01052024.pdf

Tata AIG General Insurance Company - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

Featured

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Featured (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

How robust monitoring powers high availability for linkedin feed

2. How Robust Monitoring Powers High Availability for Linkedin Feed Rushin Barot SREcon17 LinkedIn

3. Overview ● Linkedin Feed architecture ● Monitoring ● High availability for Feed ● Best practices

4. LinkedIn in 2016 ● 467 million member accounts ● 107 million active users, 40 million+ messaging conversations per week ● Kafka: 1.4 trillion messages/day across 1400 brokers ● 2-3x increase in referral traffic YoY

6. Linkedin Feed Architecture LinkedIn Business Logic Mid Tier TimeOrderd DB Data ingestion Linkedin Business Logic DataStore

7. Monitoring What do we monitor

8. Monitoring Load Avg Errors Latency SaturationInstances What do we monitor

9. Monitoring ● Ingraphs ----------> Frontend to round-robin database(rrd) ● Autometrics ----------> Self-service metrics collections ● Nurse ----------> Auto Remediation ● Iris ----------> Similar to pagerDuty How do we monitor

10. High Availability for Feed ● Data partition ---------------> 720 partitions ● Replicas ---------------> 3 replicas ● Back up and Restore ---------------> separate back up nodes ● Quotas ---------------> throttle ● Fail-over ● Live load test

11. Best Practices ● Automated hardware issue detection. ● A/B testing ● Proof rules( aka Error budgeting ) ● Canaries ● Dark Canaries

12. Dark Canary ● Mirror server in production which receives a copy of production traffic ● Responses are never sent to clients ● Source server and dark canary

13. Dark Canary

14. Dark Canary ● Read-Only Requests ● Scalability using traffic multiplier ● Dark Canary failures have no impact on production ● Differentiation from production traffic ○ Page key: dark canary requests have separate page keys. ○ Tree id: Separate tree id Specifics for Feed

15. ● New build verification ○ Detection of schema changes, code issues, misconfigurations ● New colo certification ● Warming up caches in multi-colo environment ● Capacity planning Dark Canary Usage

16. Dark Canary Capacity Measurement via Redliner ● Use live traffic in production environment ● No significant impact on production node ● Minimal maintenance overhead

17. 17 Dark Canary Capacity Measurement via Redliner

18. Results 18

19. System Stats - CPU 19

20. System Stats - Memory 20

21. Advantages and Limitations Pros ● Proactive measure to catch issues ● Low cost of testing with real production data and traffic patterns ● No tests implementation/maintenance is required ● Configurable traffic multiplier ● New code can be deployed and tested in parallel with staging before pushing it to regular Canary Cons ● Read-Only requests ● Recommended for services closer to the bottom of the stack due to calls to production downstream services ● Additional hardware for Dark Canary

22. Thank you

23. References Linkedin Tech Blog ● Followfeed Architecture ● Nurse ● Ingraphs ● Autometrics Books Site Reliability Engineering Paper Capacity Measurement and Planning for NoSQL Databases by Ramya Pasumarti, Rushin Barot, Susie Xia, Ang Xu, Haricharan Ramachandra [ IEEE SDI 2017]

How robust monitoring powers high availability for linkedin feed

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

How robust monitoring powers high availability for linkedin feed