Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

•

0 likes•159 views

Most of the modern cloud web services execute on top of runtime environments like .NET's Common Language Runtime or Java Runtime Environment. On the one hand, runtime environments provide several off-the-shelf benefits like code security and cross-platform execution. On the other hand, runtime’s features such as just-in-time compilation and automatic memory management add a non-deterministic overhead to the overall service time, increasing the tail of the latency distribution. In this context, the Garbage Collector (GC) is among the leading causes of high tail latency. To tackle this problem, we developed the Garbage Collector Control Interceptor (GCI) – a request interceptor algorithm, which is agnostic regarding the cloud service language, internals, and its incoming load. GCI is wholly decentralized and improves the tail latency of cloud services by making sure that service instances shed the incoming load while cleaning up the runtime heap. We evaluated GCI’s effectiveness in a stateful service prototype, varying the number of available instances. Our results showed that using GCI eliminates the impact of the garbage collection on the service latency for small (4 nodes) and large (64 nodes) deployments with no throughput loss.

Software

Improving Tail Latency of
Stateful Cloud Services via
GC Control and Load Shedding
Daniel Fireman danielfireman@gmail.com
João Brunet, Raquel Lopes, David Quaresma, Thiago Emmanuel Pereira

Runtime Environments (RTEs)
Manage/Control/Support the program execution
● Platform independence
● Abstractions
○ Memory model
○ Execution
○ Concurrency
● Safety
○ Checks, checks and more checks
○ Automatic memory management
● ...

Garbage Collection (GC) impact on latency
CPU Competition
+
Pauses
Overhead
distribution tail:
0.1% of the
requests were
impacted

Scaled out view of the latency impact
1 - 0.99 100
= 63%
Number of endpoints
reached to serve one
request
Considered percentile (when
does a peak happen?)

Which ends up harming
● User experience
● Predictability
● SLAs

You could try
● Tune the GC configuration
● Investigate and fix spots of GC pressure
● Go off-heap
● Switch to manual memory management
● Buy comercial implementations and support

If all that is too hard or expensive
Avoid collecting garbage while processing requests

Garbage Collector Control
Interceptor (GCI)

Components: Proxy
● Intercepts all requests
● Service/RTE agnostic
● Decides when to check heap
● Decides which requests to shed
github.com/gcinterceptor/gci-proxy

Components: Request Processor (a.k.a Agent)
● Exec. commands from the proxy
● RTE Specific
● Calls RTE’s APIs
○ Check Heap
○ GC
github.com/gcinterceptor/gci-{java,ruby, nodejs, go}

And more!
● Plug and Play
● Adaptive
● Fully decentralized
● Runtime agnostic
○ uses available APIs to trigger garbage collection and check the heap
● Transport agnostic
○ uses the available interception and load shedding mechanisms to avoid
receiving request during collection

Research Question
Does GCI shorten the tail of the latency distribution of
stateful services without significantly penalizing the service
throughput?
Small (4 nodes) and Large (64 clusters)

Simulator
github.com/gcinterceptor/gci-simulator

Results - Large Cluster
Does GCI shorten the tail of the latency distribution of
stateful services without significantly penalizing the service
throughput?

Results - Large Cluster
● 99th → 30%
● 99.999th → 47%
● No throughput loss
Yes!

Conclusions and Next Steps
● Avoid processing requests while GC’ing is effective to improve tail latency of
stateful cloud services
GCI is an easy-to-use mechanism to achieve that
● Furthermore, it is adaptive, low overhead and fully distributed!
Next we would like to:
● Decrease the proxy overhead
● Experiment a more realistic application (e.g. Hazelcast) and load (e.g. YCSB)
● Investigate stateless and DSP cloud applications

Thank you
@daniellfireman@danielfireman
github.com/gcinterceptor

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered and written out in real-time with an ease matching that of batch processing. However the real challenge of matching the prowess of batch ETL remains in doing joins, in maintaining state and to have the data be paused or rested dynamically. Netflix has a microservices architecture. Different microservices serve and record different kind of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when we want to combine the events coming from one high-traffic microservice to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations. Historically we have done this joining of large volume data-sets in batch. However we asked ourselves if the data is being generated in real-time, why must it not be processed downstream in real time? Why wait a full day to get information from an event that was generated a few mins ago? In this talk, we will share how we solved a complex join of two high-volume event streams using Flink. We will talk about maintaining large state, fault tolerance of a stateful application and strategies for failure recovery.

Reliability at scale

praveen shukla

Memory in go

Iman Tunggono

Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...

Flink Forward

Distributed tracing is used to analyze performance and error cases in service oriented architectures. The Observability team at Airbnb recently created Upshot, a data pipeline that uses Flink to analyze over 40 million trace events per minute. Summaries of the resulting data are sent to Druid, Datadog, and other downstream datastores. This talk will focus on how we use Flink and how we analyzed and addressed scaling issues we encountered while building Upshot.

BAXTER phase 1b

Franck MIKULECZ

Service Level Management

I Nyoman Sujana

Ceilometer juno-midpoint

Eoghan Glynn

Implementation of Model Predictive Controller for a drone

Kandai Watanabe

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day. These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate. Our presentation will be based on our recent experience in building a real-time data analytics platform for telco events. This platform has been jointly built by GetInData and Kcell - the leading telco in Kazakhstan - in just a few months and it currently runs in production at the scale of 10M subscribers and 160K events per second on a still small cluster. It's used as a backbone for personalized marketing campaigns, detecting frauds, cross-sell & up-sell by following the behavior of millions of users in real-time and reacting to it instantly. We will share how we build such platform using current best of breed open-source projects like Flink, Kafka, and Nifi. We won't skimp on the details how we designed and optimized our Flink applications for high load and performance. We will also describe challenges that we faced during development and try to provide some tips what one should pay attention to when developing similar solutions, not only for telco, but also for banks, e-commerce, IoT and other industries.

Monitoring with riemann

Abhishek Amralkar

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Sumanth Pasupuleti

ICANN DNS Symposium (IDS 2019): RDAP CDN Distribution Experience

APNIC

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

Flink Forward

Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.

vk1Vikas Verma

Serverless Apps on Google Cloud: more dev, less ops

Joseph Lust

Container world 2019 Canary Release

Billy Yuen

Netflix SRE perf meetup_slides

Ed Hunter

Infrastructure setup on Google Cloud using terraform and Ansible

Knoldus Inc.

Scaling Monitoring At Databricks From Prometheus to M3

LibbySchulze

Lessons learned from designing QA automation event streaming platform(IoT big...

Omid Vahdaty

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

confluent

The number of deployments of Apache Kafka at enterprise scale has greatly increased in the years since Kafka’s original development in 2010. Along with this rapid growth has come a wide variety of use cases and deployment strategies that transcend what Kafka’s creators imagined when they originally developed the technology. As the scope and reach of streaming data platforms based on Apache Kafka has grown, the need to understand monitoring and troubleshooting strategies has as well. Dustin Cote and Ryan Pridgeon share their experience supporting Apache Kafka at enterprise-scale and explore monitoring and troubleshooting techniques to help you avoid pitfalls when scaling large-scale Kafka deployments. Topics include: - Effective use of JMX for Kafka - Tools for preventing small problems from becoming big ones - Efficient architectures proven in the wild - Finding and storing the right information when it all goes wrong Visit www.confluent.io for more information.

AWS Techniques and lessons writing low cost autoscaling GitLab runners

Anthony Scata

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)

Brian Brazil

Reactive by example (DevOpsDaysTLV 2019)

Eran Harel

The reactive manifesto is meant to guide you in building Responsive, Resilient, Elastic (scalable), and Message Driven systems. But these are all bombastic words which are quite meaningless without a good context or good examples. This talk will walk you through a story of improving a real life service, bringing it to perform well, and link the steps to the reactive manifesto cornerstones. Videos link: https://youtu.be/Fg1SJufaHOs

DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...

Haggai Philip Zagury

Designing apps for resiliency

Masashi Narumoto

Three Perspectives on Measuring Latency

ScyllaDB

Latency is one of the most common Service Level Indicators (SLI), but where should it be measured from? There are three main ways to measure latency: •Server-side latency: Precise and high cardinality but missing the big picture •Client-side latency: Big picture but noisy •Blackbox monitoring latency: Good trade-off between the other two In this talk, we will dive deeper into each perspective and how all of them can be leveraged. We will use Criteo’s large-scale key/value infrastructure as a case study

Benchmarks, performance, scalability, and capacity what s behind the numbers...

james tong

What's hot

QE integrated in XTM, by Bob Willans (XTM)

TAUS - The Language Data Network

Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...

Flink Forward

Monitoring with riemann

Abhishek Amralkar

Cassandra Meetup Nov 2019 - Cassandra Resiliency

Sumanth Pasupuleti

ICANN DNS Symposium (IDS 2019): RDAP CDN Distribution Experience

APNIC

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

Flink Forward

vk1Vikas Verma

Serverless Apps on Google Cloud: more dev, less ops

Joseph Lust

What's hot (8)

QE integrated in XTM, by Bob Willans (XTM)

Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...

Monitoring with riemann

Cassandra Meetup Nov 2019 - Cassandra Resiliency

ICANN DNS Symposium (IDS 2019): RDAP CDN Distribution Experience

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

vk1

Serverless Apps on Google Cloud: more dev, less ops

Similar to Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Container world 2019 Canary Release

Billy Yuen

Netflix SRE perf meetup_slides

Ed Hunter

Infrastructure setup on Google Cloud using terraform and Ansible

Knoldus Inc.

Scaling Monitoring At Databricks From Prometheus to M3

LibbySchulze

Lessons learned from designing QA automation event streaming platform(IoT big...

Omid Vahdaty

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

confluent

AWS Techniques and lessons writing low cost autoscaling GitLab runners

Anthony Scata

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)

Brian Brazil

Reactive by example (DevOpsDaysTLV 2019)

Eran Harel

DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...

Haggai Philip Zagury

Designing apps for resiliency

Masashi Narumoto

Three Perspectives on Measuring Latency

ScyllaDB

Benchmarks, performance, scalability, and capacity what s behind the numbers...

james tong

Benchmarks, performance, scalability, and capacity what's behind the numbers

Justin Dorfman

2 years into drinking the Microservice kool-aid (Fact and Fiction)

roblund

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

Ambassador Labs

Monitoring and automation

Ricardo Bánffy

Rate limits and Performance

supergigas

Microservices summit talk 1/31

Varun Talwar

Similar to Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding (20)

Container world 2019 Canary Release

Netflix SRE perf meetup_slides

Infrastructure setup on Google Cloud using terraform and Ansible

Scaling Monitoring At Databricks From Prometheus to M3

Lessons learned from designing QA automation event streaming platform(IoT big...

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka

AWS Techniques and lessons writing low cost autoscaling GitLab runners

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)

Reactive by example (DevOpsDaysTLV 2019)

DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...

Designing apps for resiliency

Three Perspectives on Measuring Latency

Benchmarks, performance, scalability, and capacity what s behind the numbers...

Benchmarks, performance, scalability, and capacity what's behind the numbers

2 years into drinking the Microservice kool-aid (Fact and Fiction)

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

Monitoring and automation

Rate limits and Performance

Microservices summit talk 1/31

Recently uploaded

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

Mobile App Development Company In Noida | Drona Infotech

Drona Infotech

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

lorraineandreiamcidl

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Łukasz Chruściel

No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception. In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed. We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.

Graspan: A Big Data System for Big Code Analysis

Aftab Hussain

We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18. - Accepted in ASPLOS ‘17, Xi’an, China. - Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17. - Invited for presentation at SoCal PLS ‘16. - Invited for poster presentation at PLDI SRC ‘16.

openEuler Case Study - The Journey to Supply Chain Security

Shane Coughlan

Launch Your Streaming Platforms in Minutes

Roshan Dwivedi

The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown: Pros of Speedy Streaming Platform Launch Services: No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge. Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker. All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations. Things to Consider: Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions. Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option. Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options). Examples of Services for Launching Streaming Platforms: Muvi [muvi com] Uscreen [usencreen tv] Alternatives to Consider: Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited. Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform. Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.

Fundamentals of Programming and Language Processors

Rakesh Kumar R

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

Mind IT Systems

AI Genie Review: World’s First Open AI WordPress Website Creator

Google

AI Genie Review: World’s First Open AI WordPress Website Creator 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-genie-review AI Genie Review: Key Features ✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche ✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI ✅Publish Automated Posts and Pages using AI Genie directly on Your website ✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself ✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors ✅Just Enter the title, and your Content for Pages and Posts will be ready on your website ✅Automatically insert visually appealing images into posts based on keywords and titles. ✅Choose the temperature of the content and control its randomness. ✅Control the length of the content to be generated. ✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms ✅100% Easy-to-Use, Newbie-Friendly Technology ✅30-Days Money-Back Guarantee See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review #AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

OpenMetadata Community Meeting - 5th June 2024

OpenMetadata

The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features. * How to run your own data quality framework * What is the performance impact of running data quality frameworks * How to run the test cases in your own ETL pipelines * How the Incident Manager is integrated * Get notified with alerts when test cases fail Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

Alina Yurenko

APIs for Browser Automation (MoT Meetup 2024)

Boni García

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

GraphSummit Paris - The art of the possible with Graph Technology

Neo4j

Using Xen Hypervisor for Functional Safety

Ayan Halder

Recently uploaded (20)

Artificia Intellicence and XPath Extension Functions

Empowering Growth with Best Software Development Company in Noida - Deuglo

Mobile App Development Company In Noida | Drona Infotech

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Navigating the Metaverse: A Journey into Virtual Evolution"

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM

Need for Speed: Removing speed bumps from your Symfony projects ⚡️

Graspan: A Big Data System for Big Code Analysis

openEuler Case Study - The Journey to Supply Chain Security

Launch Your Streaming Platforms in Minutes

Fundamentals of Programming and Language Processors

Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...

AI Genie Review: World’s First Open AI WordPress Website Creator

Vitthal Shirke Java Microservices Resume.pdf

OpenMetadata Community Meeting - 5th June 2024

GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)

APIs for Browser Automation (MoT Meetup 2024)

May Marketo Masterclass, London MUG May 22 2024.pdf

GraphSummit Paris - The art of the possible with Graph Technology

Using Xen Hypervisor for Functional Safety

Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

1. Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding Daniel Fireman danielfireman@gmail.com João Brunet, Raquel Lopes, David Quaresma, Thiago Emmanuel Pereira

2. Runtime Environments (RTEs) Manage/Control/Support the program execution ● Platform independence ● Abstractions ○ Memory model ○ Execution ○ Concurrency ● Safety ○ Checks, checks and more checks ○ Automatic memory management ● ...

3. Cloud loves RTEs

4. Garbage Collection (GC) impact on latency CPU Competition + Pauses Overhead distribution tail: 0.1% of the requests were impacted

5. Scaled out view of the latency impact 1 - 0.99 100 = 63% Number of endpoints reached to serve one request Considered percentile (when does a peak happen?)

6. Which ends up harming ● User experience ● Predictability ● SLAs

7. You could try ● Tune the GC configuration ● Investigate and fix spots of GC pressure ● Go off-heap ● Switch to manual memory management ● Buy comercial implementations and support

8. If all that is too hard or expensive Avoid collecting garbage while processing requests

9. Garbage Collector Control Interceptor (GCI)

10. Overview

11. Components: Proxy ● Intercepts all requests ● Service/RTE agnostic ● Decides when to check heap ● Decides which requests to shed github.com/gcinterceptor/gci-proxy

12. Components: Request Processor (a.k.a Agent) ● Exec. commands from the proxy ● RTE Specific ● Calls RTE’s APIs ○ Check Heap ○ GC github.com/gcinterceptor/gci-{java,ruby, nodejs, go}

13. How GCI works: part 1

14. How GCI works: part 2

15. How GCI works: part 3

16. And more! ● Plug and Play ● Adaptive ● Fully decentralized ● Runtime agnostic ○ uses available APIs to trigger garbage collection and check the heap ● Transport agnostic ○ uses the available interception and load shedding mechanisms to avoid receiving request during collection

17. Evaluation

18. Research Question Does GCI shorten the tail of the latency distribution of stateful services without significantly penalizing the service throughput? Small (4 nodes) and Large (64 clusters)

19. Stateful services matter

20. Simulator github.com/gcinterceptor/gci-simulator

21. Results - Large Cluster Does GCI shorten the tail of the latency distribution of stateful services without significantly penalizing the service throughput?

22. Results - Large Cluster ● 99th → 30% ● 99.999th → 47% ● No throughput loss Yes!

23. Conclusions and Next Steps ● Avoid processing requests while GC’ing is effective to improve tail latency of stateful cloud services GCI is an easy-to-use mechanism to achieve that ● Furthermore, it is adaptive, low overhead and fully distributed! Next we would like to: ● Decrease the proxy overhead ● Experiment a more realistic application (e.g. Hazelcast) and load (e.g. YCSB) ● Investigate stateless and DSP cloud applications

24. Thank you @daniellfireman@danielfireman github.com/gcinterceptor

Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Similar to Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding (20)

Recently uploaded

Recently uploaded (20)

Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding