SlideShare a Scribd company logo
1 of 31
Download to read offline
© 2023 Akamas • All Rights Reserved • Confidential
Kubernetes performance
tuning dilemma: How to
solve it with AI
Stefano Doni, CTO
© 2023 Akamas • All Rights Reserved • Confidential
Agenda
1 The problem
2 Tuning challenges for modern K8s apps
3 AI-powered optimization
4 Demo
© 2023 Akamas • All Rights Reserved • Confidential
● Obsessed with performance optimization
● 18+ years of capacity & performance work
● CMG speaker since 2014, Best paper on Java
performance & efficiency in 2015
● Co-founder and CTO @ Akamas,
the software platform for autonomous
optimization, powered by AI
Who Am I
© 2023 Akamas • All Rights Reserved • Confidential
Kubernetes has become the operating system
of the cloud
Cloud Native Computing Foundation, Annual Survey 2021
96% of organizations are either using or evaluating Kubernetes
© 2023 Akamas • All Rights Reserved • Confidential
The dark side of Kubernetes
youtu.be/watch?v=4CT0cI62YHk youtu.be/QXApVwRBeys
Cost efficiency Apps reliability Apps performance
Kubernetes FinOps Report, 2021 June
Kubernetes failure stories: k8s.af
© 2023 Akamas • All Rights Reserved • Confidential
Application runtime
resource management
Kubernetes resource management
● Memory sizing
● Garbage collection
● Compiler & thread settings
● Container resource requests & limits
● Number of replicas
● Horizontal auto-scaling settings
New challenges for cloud-native apps
100s-1000s microservices
10s-100s inter-dependent
configurations
© 2023 Akamas • All Rights Reserved • Confidential
Why is K8s so hard?
K8s resource management
© 2023 Akamas • All Rights Reserved • Confidential
Pod A Pod B
Resource requests drive K8s cluster costs
CPU
Memory
● Requests are resources the container is guaranteed to get
● Cluster capacity is based on pod resource requests - there is no overcommitment!
● Resource requests != resource utilization: a cluster can be full even if utilization is 10%
Node (4 CPU, 8 GB Memory)
Resource requests from pod manifest
Pod A
2 cores
2GB
Memory
Pod A
apiVersion: v1
kind: Pod
metadata:
name: Pod A
spec:
containers:
- name: app
image: nginx:1.1
resources:
requests:
memory: “2Gi”
cpu: “2”
2 4
2 4 6 8
Pod B
Resource used
© 2023 Akamas • All Rights Reserved • Confidential
Resource limits may strongly impact application
performance and stability
● A container can consume more resources than it has requested
● Resource limits allow to specify the maximum resources a container can use (e.g. CPU = 2)
● When a container hits its resource limits bad things can happen
Container CPU limit
Container Memory limit
K8s throttle container CPU ->
Application performance slowdown
When hitting
Memory Limits
When hitting
CPU Limits
K8s kills the container -> Application
stability issues
X
CPU
Usage
Memory
Usage
© 2023 Akamas • All Rights Reserved • Confidential
CPU throttling impacts cost & performance in
surprising ways
SRE
Significant CPU
throttling…
… with CPU < 40%
“The container's CPU use is being throttled,
because the container is attempting to use
more CPU resources than its limit”
https://kubernetes.io/docs/tasks/configure-pod-
container/assign-cpu-resource
Why do I have CPU throttling if I’m
using less than 40% of my CPU limit?
Must be a K8s issue…
Perf. impact
© 2023 Akamas • All Rights Reserved • Confidential
Fact #4: Setting resource requests and limits is
required to ensure Kubernetes stability
“While your Kubernetes cluster might work
fine without setting resource requests and
limits, you will start running into stability
issues as your teams and projects grow”
(Google, Kubernetes best practices)
https://cloud.google.com/blog/products/containers-kubernetes/
kubernetes-best-practices-resource-requests-and-limits
© 2023 Akamas • All Rights Reserved • Confidential
Why is K8s so hard?
Application runtime resource
management
© 2023 Akamas • All Rights Reserved • Confidential
App runtimes are highly configurable engines
“Because Java is so often deployed on servers, this kind of
performance tuning is an essential activity for many
organizations.
The JVM is highly configurable with literally hundreds of
command-line options and switches. These switches provide
performance engineers a gold mine of possibilities to explore in
the pursuit of the optimal configuration for a given workload
on a given platform.”
$ docker run eclipse-temurin:11-alpine java -XX:+PrintFlagsFinal
© 2023 Akamas • All Rights Reserved • Confidential
Why heap size tuning is important? JVM uses
all of the available memory
2 GiB
1.2 GiB
JVM heap
used
JVM max heap
App response time
● The JVM tends to use all of the memory it has been configured with
● Sizing based on K8s container memory usage is going to miss a lot of savings
● Experiment with JVM max heap size to see how much you can save - while monitoring app performance!
Key
Takeaways
-40%
Mem used
© 2023 Akamas • All Rights Reserved • Confidential
Max heap size is set by default to 25% of container memory limit
You can tune the 25% via the -XX:MaxRAMPercentage parameter:
Alternatively, you can always set a fixed max heap size with the -Xmx parameter:
How does the JVM set the max heap size in
K8s? JVM container-aware ergonomics
$ docker run --memory 1G eclipse-temurin:11-alpine java -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize
size_t MaxHeapSize = 268435456 {product} {ergonomic}
$ docker run --memory 1G eclipse-temurin:11-alpine java -XX:MaxRAMPercentage=50 -XX:+PrintFlagsFinal 2>&1 | grep -w
MaxHeapSize
size_t MaxHeapSize = 536870912 {product} {ergonomic}
$ docker run --memory 1G eclipse-temurin:11-alpine java -Xmx1024M -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize
size_t MaxHeapSize = 1073741824 {product} {command line}
© 2023 Akamas • All Rights Reserved • Confidential
JVM ergonomics in K8s are tricky
Source: Microsoft
● JVM ergonomics do a lot of magic stuff, but they are tricky to understand and may do the wrong thing!
● MaxRAMPercentage default is very conservative: increase it, but watch out for out of memory kills by k8s
● Do not trust JVM ergonomics: it’s best to explicitly set JVM flags to avoid surprises
Key
Takeaways
© 2023 Akamas • All Rights Reserved • Confidential
OOM kills your app reliability - but better heap
sizing can fix them
Container memory used hits the
memory limit, triggering K8s
out-of-memory killer
Context: Java microservices getting restarted due to out-of-memory kill by K8s
SRE
Container memory limit
Container memory used
Availability
impact
My containers keep getting
OOM killed… Is this a memory
leak or a misconfiguration?
Let’s increase the memory
limit just in case…
© 2023 Akamas • All Rights Reserved • Confidential
App runtime memory management
Key
Takeaways
● Heap max heap size is the main memory tuning parameter (e.g. JVM -Xmx or -XX:MaxRAMPercentage)
● Off-heap cannot be sized via configuration options - memory usage depends on your application (200 MB up to
1GB is common for the JVM)
● You need to monitor your app in production and take both spaces into account when sizing memory to achieve
cost efficient and reliable microservices
Heap Threads
JVM max heap size
K8s container memory limit
JVM off-heap
Classes Compiler
JVM memory
Initial
Heap
Garbage
Collector
© 2023 Akamas • All Rights Reserved • Confidential
GC tuning can lead to big cost benefits
1500
millicores
600
millicores
CPU used
App response
time
G1 GC
(-XX:+UseG1GC)
Parallel GC
(-XX:+UseParallelGC)
-60%
CPU used
© 2023 Akamas • All Rights Reserved • Confidential
JVM default ergonomics in K8s: garbage
collector
2 4 6 8
1
Number of
CPUs
Memory
(MB)
1791 MB
Serial GC
G1 GC
Key
Takeaways
● Default GC selection is based on hard-coded thresholds defined decades ago
● You may end up paying the cost of a suboptimal GC, and you may not even know it!
● Other good collectors like Parallel GC are not considered
● Do not trust JVM ergonomics - always set your JVM options!
© 2023 Akamas • All Rights Reserved • Confidential
Golang CPU reduction with GOGC tuning
400
millicores
180
millicores
-55%
CPU used
Node.js has a lot of tuning flags as well (flaviocopes.com/node-runtime-v8-options)
© 2023 Akamas • All Rights Reserved • Confidential
How to solve this problem?
Performance Engineering to
the rescue!
© 2023 Akamas • All Rights Reserved • Confidential
The industry standard performance tuning
process
Analyze system
performance
Identify tuning
parameters
Change one
parameter
Test system
with new config
it’s manual, slow and error-prone, requires deep skills, doesn’t scale, is not continuous…
Optimizing cloud-native applications requires a better approach!
© 2023 Akamas • All Rights Reserved • Confidential
Enter AI-driven
Optimization
© 2023 Akamas • All Rights Reserved • Confidential
Autonomous optimization key capabilities
© 2022 Akamas • All Rights Reserved • Confidential
Autonomous optimization process
© 2023 Akamas • All Rights Reserved • Confidential
Optimization
Studies
Live
Optimizations
The Akamas Platform
© 2023 Akamas • All Rights Reserved • Confidential
Reducing cost of a Kubernetes
microservice, while preserving
app performance & reliability
Demo
© 2023 Akamas • All Rights Reserved • Confidential
Key takeaways
● K8s enables unprecedented scalability & efficiency, but it’s not automatic
● Tuning is your responsibility - if you don’t tune, you don’t save!
● The biggest cost & reliability wins lie in K8s workload and app runtime layers -
don’t rely on ergonomics!
● AI-powered optimization enables you to automate tuning and achieve savings
at scale
1
2
3
4
© 2023 Akamas • All Rights Reserved • Confidential
Q&A
Contacts
info@akamas.io
@AkamasLabs
@akamaslabs
Italy HQ
Via Schiaffino 11
Milan, 20158
+39-02-4951-7001
USA East
211 Congress Street
Boston, MA 02110
+1-617-936-0212
Singapore
5 Temasek Blvd
Singapore 038985
USA West
12130 Millennium Drive
Los Angeles, CA 90094
+1-323-524-0524
LinkedIn Twitter
Email
© 2023 Akamas • All Rights Reserved • Confidential

More Related Content

Similar to GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: How to solve it with AI

Aem hub oak 0.2 full
Aem hub oak 0.2 fullAem hub oak 0.2 full
Aem hub oak 0.2 full
Michael Marth
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
Edge AI and Vision Alliance
 

Similar to GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: How to solve it with AI (20)

DevNexus 2024: Just-In-Time Compilation as a Service for cloud-native Java mi...
DevNexus 2024: Just-In-Time Compilation as a Service for cloud-native Java mi...DevNexus 2024: Just-In-Time Compilation as a Service for cloud-native Java mi...
DevNexus 2024: Just-In-Time Compilation as a Service for cloud-native Java mi...
 
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
 
JPrime_JITServer.pptx
JPrime_JITServer.pptxJPrime_JITServer.pptx
JPrime_JITServer.pptx
 
Gear6 Web Cache Overview
Gear6 Web Cache OverviewGear6 Web Cache Overview
Gear6 Web Cache Overview
 
IBM Maximo Performance Tuning
IBM Maximo Performance TuningIBM Maximo Performance Tuning
IBM Maximo Performance Tuning
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
 
Aem hub oak 0.2 full
Aem hub oak 0.2 fullAem hub oak 0.2 full
Aem hub oak 0.2 full
 
SemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptx
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
 
How to Integrate Hyperconverged Systems with Existing SANs
How to Integrate Hyperconverged Systems with Existing SANsHow to Integrate Hyperconverged Systems with Existing SANs
How to Integrate Hyperconverged Systems with Existing SANs
 
Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8
 
Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) -...
Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) -...Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) -...
Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) -...
 
JITServerTalk Nebraska 2023.pdf
JITServerTalk Nebraska 2023.pdfJITServerTalk Nebraska 2023.pdf
JITServerTalk Nebraska 2023.pdf
 
Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2Yarns about YARN: Migrating to MapReduce v2
Yarns about YARN: Migrating to MapReduce v2
 
Accelerating Containerized Workloads with Amazon EC2 Spot Instances - AWS Onl...
Accelerating Containerized Workloads with Amazon EC2 Spot Instances - AWS Onl...Accelerating Containerized Workloads with Amazon EC2 Spot Instances - AWS Onl...
Accelerating Containerized Workloads with Amazon EC2 Spot Instances - AWS Onl...
 
V mware v fabric 5 - what's new technical sales training presentation
V mware v fabric 5 - what's new technical sales training presentationV mware v fabric 5 - what's new technical sales training presentation
V mware v fabric 5 - what's new technical sales training presentation
 
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
How Netflix Tunes Amazon EC2 Instances for Performance - CMP325 - re:Invent 2017
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319
 
Multi-Arch Infra From the Ground Up.pptx
Multi-Arch Infra From the Ground Up.pptxMulti-Arch Infra From the Ground Up.pptx
Multi-Arch Infra From the Ground Up.pptx
 

More from James Anderson

GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
James Anderson
 
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson
 
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdfGraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
James Anderson
 
GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
 GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ... GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
James Anderson
 
A3 - AR Code Planetarium CST.pdf
A3 - AR Code Planetarium CST.pdfA3 - AR Code Planetarium CST.pdf
A3 - AR Code Planetarium CST.pdf
James Anderson
 
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
James Anderson
 
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
James Anderson
 

More from James Anderson (20)

GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
GDG Cloud Southlake 31: Santosh Chennuri and Festus Yeboah: Empowering Develo...
GDG Cloud Southlake 31: Santosh Chennuri and Festus Yeboah: Empowering Develo...GDG Cloud Southlake 31: Santosh Chennuri and Festus Yeboah: Empowering Develo...
GDG Cloud Southlake 31: Santosh Chennuri and Festus Yeboah: Empowering Develo...
 
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
 
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for KubernetesGDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
GDG Cloud Southlake 29 Jimmy Mesta OWASP Top 10 for Kubernetes
 
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
 
GDG SLK - Why should devs care about container security.pdf
GDG SLK - Why should devs care about container security.pdfGDG SLK - Why should devs care about container security.pdf
GDG SLK - Why should devs care about container security.pdf
 
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdfGraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
GraphQL Insights Deck ( Sabre_GDG - Sept 2023).pdf
 
GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
 GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ... GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
GDG Cloud Southlake #25: Jacek Ostrowski & David Browne: Sabre's Journey to ...
 
A3 - AR Code Planetarium CST.pdf
A3 - AR Code Planetarium CST.pdfA3 - AR Code Planetarium CST.pdf
A3 - AR Code Planetarium CST.pdf
 
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
GDG Cloud Southlake #24: Arty Starr: Enabling Powerful Software Insights by V...
 
GDG Cloud Southlake #23:Ralph Lloren: Social Engineering Large Language Models
GDG Cloud Southlake #23:Ralph Lloren: Social Engineering Large Language ModelsGDG Cloud Southlake #23:Ralph Lloren: Social Engineering Large Language Models
GDG Cloud Southlake #23:Ralph Lloren: Social Engineering Large Language Models
 
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
GDG Cloud Southlake no. 22 Gutta and Nayer GCP Terraform Modules Scaling Your...
 
GDG Cloud Southlake #21:Alexander Snegovoy: Master Continuous Resiliency in C...
GDG Cloud Southlake #21:Alexander Snegovoy: Master Continuous Resiliency in C...GDG Cloud Southlake #21:Alexander Snegovoy: Master Continuous Resiliency in C...
GDG Cloud Southlake #21:Alexander Snegovoy: Master Continuous Resiliency in C...
 
GDG Cloud Southlake #19: Sullivan and Schuh: Design Thinking Primer: How to B...
GDG Cloud Southlake #19: Sullivan and Schuh: Design Thinking Primer: How to B...GDG Cloud Southlake #19: Sullivan and Schuh: Design Thinking Primer: How to B...
GDG Cloud Southlake #19: Sullivan and Schuh: Design Thinking Primer: How to B...
 
GDG Cloud Southlake #18 Yujun Liang Crawl, Walk, Run My Journey into Google C...
GDG Cloud Southlake #18 Yujun Liang Crawl, Walk, Run My Journey into Google C...GDG Cloud Southlake #18 Yujun Liang Crawl, Walk, Run My Journey into Google C...
GDG Cloud Southlake #18 Yujun Liang Crawl, Walk, Run My Journey into Google C...
 
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for EveryoneGDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
GDG Cloud Southlake #17: Meg Dickey-Kurdziolek: Explainable AI is for Everyone
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
GDG Cloud Southlake #15: Mihir Mistry: Cybersecurity and Data Privacy in an A...
GDG Cloud Southlake #15: Mihir Mistry: Cybersecurity and Data Privacy in an A...GDG Cloud Southlake #15: Mihir Mistry: Cybersecurity and Data Privacy in an A...
GDG Cloud Southlake #15: Mihir Mistry: Cybersecurity and Data Privacy in an A...
 
GDG Cloud Southlake #14: Jonathan Schneider: OpenRewrite: Making your source ...
GDG Cloud Southlake #14: Jonathan Schneider: OpenRewrite: Making your source ...GDG Cloud Southlake #14: Jonathan Schneider: OpenRewrite: Making your source ...
GDG Cloud Southlake #14: Jonathan Schneider: OpenRewrite: Making your source ...
 
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud BoundariesGDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
GDG Cloud Southlake #9 Secure Cloud Networking - Beyond Cloud Boundaries
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: How to solve it with AI

  • 1. © 2023 Akamas • All Rights Reserved • Confidential Kubernetes performance tuning dilemma: How to solve it with AI Stefano Doni, CTO
  • 2. © 2023 Akamas • All Rights Reserved • Confidential Agenda 1 The problem 2 Tuning challenges for modern K8s apps 3 AI-powered optimization 4 Demo
  • 3. © 2023 Akamas • All Rights Reserved • Confidential ● Obsessed with performance optimization ● 18+ years of capacity & performance work ● CMG speaker since 2014, Best paper on Java performance & efficiency in 2015 ● Co-founder and CTO @ Akamas, the software platform for autonomous optimization, powered by AI Who Am I
  • 4. © 2023 Akamas • All Rights Reserved • Confidential Kubernetes has become the operating system of the cloud Cloud Native Computing Foundation, Annual Survey 2021 96% of organizations are either using or evaluating Kubernetes
  • 5. © 2023 Akamas • All Rights Reserved • Confidential The dark side of Kubernetes youtu.be/watch?v=4CT0cI62YHk youtu.be/QXApVwRBeys Cost efficiency Apps reliability Apps performance Kubernetes FinOps Report, 2021 June Kubernetes failure stories: k8s.af
  • 6. © 2023 Akamas • All Rights Reserved • Confidential Application runtime resource management Kubernetes resource management ● Memory sizing ● Garbage collection ● Compiler & thread settings ● Container resource requests & limits ● Number of replicas ● Horizontal auto-scaling settings New challenges for cloud-native apps 100s-1000s microservices 10s-100s inter-dependent configurations
  • 7. © 2023 Akamas • All Rights Reserved • Confidential Why is K8s so hard? K8s resource management
  • 8. © 2023 Akamas • All Rights Reserved • Confidential Pod A Pod B Resource requests drive K8s cluster costs CPU Memory ● Requests are resources the container is guaranteed to get ● Cluster capacity is based on pod resource requests - there is no overcommitment! ● Resource requests != resource utilization: a cluster can be full even if utilization is 10% Node (4 CPU, 8 GB Memory) Resource requests from pod manifest Pod A 2 cores 2GB Memory Pod A apiVersion: v1 kind: Pod metadata: name: Pod A spec: containers: - name: app image: nginx:1.1 resources: requests: memory: “2Gi” cpu: “2” 2 4 2 4 6 8 Pod B Resource used
  • 9. © 2023 Akamas • All Rights Reserved • Confidential Resource limits may strongly impact application performance and stability ● A container can consume more resources than it has requested ● Resource limits allow to specify the maximum resources a container can use (e.g. CPU = 2) ● When a container hits its resource limits bad things can happen Container CPU limit Container Memory limit K8s throttle container CPU -> Application performance slowdown When hitting Memory Limits When hitting CPU Limits K8s kills the container -> Application stability issues X CPU Usage Memory Usage
  • 10. © 2023 Akamas • All Rights Reserved • Confidential CPU throttling impacts cost & performance in surprising ways SRE Significant CPU throttling… … with CPU < 40% “The container's CPU use is being throttled, because the container is attempting to use more CPU resources than its limit” https://kubernetes.io/docs/tasks/configure-pod- container/assign-cpu-resource Why do I have CPU throttling if I’m using less than 40% of my CPU limit? Must be a K8s issue… Perf. impact
  • 11. © 2023 Akamas • All Rights Reserved • Confidential Fact #4: Setting resource requests and limits is required to ensure Kubernetes stability “While your Kubernetes cluster might work fine without setting resource requests and limits, you will start running into stability issues as your teams and projects grow” (Google, Kubernetes best practices) https://cloud.google.com/blog/products/containers-kubernetes/ kubernetes-best-practices-resource-requests-and-limits
  • 12. © 2023 Akamas • All Rights Reserved • Confidential Why is K8s so hard? Application runtime resource management
  • 13. © 2023 Akamas • All Rights Reserved • Confidential App runtimes are highly configurable engines “Because Java is so often deployed on servers, this kind of performance tuning is an essential activity for many organizations. The JVM is highly configurable with literally hundreds of command-line options and switches. These switches provide performance engineers a gold mine of possibilities to explore in the pursuit of the optimal configuration for a given workload on a given platform.” $ docker run eclipse-temurin:11-alpine java -XX:+PrintFlagsFinal
  • 14. © 2023 Akamas • All Rights Reserved • Confidential Why heap size tuning is important? JVM uses all of the available memory 2 GiB 1.2 GiB JVM heap used JVM max heap App response time ● The JVM tends to use all of the memory it has been configured with ● Sizing based on K8s container memory usage is going to miss a lot of savings ● Experiment with JVM max heap size to see how much you can save - while monitoring app performance! Key Takeaways -40% Mem used
  • 15. © 2023 Akamas • All Rights Reserved • Confidential Max heap size is set by default to 25% of container memory limit You can tune the 25% via the -XX:MaxRAMPercentage parameter: Alternatively, you can always set a fixed max heap size with the -Xmx parameter: How does the JVM set the max heap size in K8s? JVM container-aware ergonomics $ docker run --memory 1G eclipse-temurin:11-alpine java -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize size_t MaxHeapSize = 268435456 {product} {ergonomic} $ docker run --memory 1G eclipse-temurin:11-alpine java -XX:MaxRAMPercentage=50 -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize size_t MaxHeapSize = 536870912 {product} {ergonomic} $ docker run --memory 1G eclipse-temurin:11-alpine java -Xmx1024M -XX:+PrintFlagsFinal 2>&1 | grep -w MaxHeapSize size_t MaxHeapSize = 1073741824 {product} {command line}
  • 16. © 2023 Akamas • All Rights Reserved • Confidential JVM ergonomics in K8s are tricky Source: Microsoft ● JVM ergonomics do a lot of magic stuff, but they are tricky to understand and may do the wrong thing! ● MaxRAMPercentage default is very conservative: increase it, but watch out for out of memory kills by k8s ● Do not trust JVM ergonomics: it’s best to explicitly set JVM flags to avoid surprises Key Takeaways
  • 17. © 2023 Akamas • All Rights Reserved • Confidential OOM kills your app reliability - but better heap sizing can fix them Container memory used hits the memory limit, triggering K8s out-of-memory killer Context: Java microservices getting restarted due to out-of-memory kill by K8s SRE Container memory limit Container memory used Availability impact My containers keep getting OOM killed… Is this a memory leak or a misconfiguration? Let’s increase the memory limit just in case…
  • 18. © 2023 Akamas • All Rights Reserved • Confidential App runtime memory management Key Takeaways ● Heap max heap size is the main memory tuning parameter (e.g. JVM -Xmx or -XX:MaxRAMPercentage) ● Off-heap cannot be sized via configuration options - memory usage depends on your application (200 MB up to 1GB is common for the JVM) ● You need to monitor your app in production and take both spaces into account when sizing memory to achieve cost efficient and reliable microservices Heap Threads JVM max heap size K8s container memory limit JVM off-heap Classes Compiler JVM memory Initial Heap Garbage Collector
  • 19. © 2023 Akamas • All Rights Reserved • Confidential GC tuning can lead to big cost benefits 1500 millicores 600 millicores CPU used App response time G1 GC (-XX:+UseG1GC) Parallel GC (-XX:+UseParallelGC) -60% CPU used
  • 20. © 2023 Akamas • All Rights Reserved • Confidential JVM default ergonomics in K8s: garbage collector 2 4 6 8 1 Number of CPUs Memory (MB) 1791 MB Serial GC G1 GC Key Takeaways ● Default GC selection is based on hard-coded thresholds defined decades ago ● You may end up paying the cost of a suboptimal GC, and you may not even know it! ● Other good collectors like Parallel GC are not considered ● Do not trust JVM ergonomics - always set your JVM options!
  • 21. © 2023 Akamas • All Rights Reserved • Confidential Golang CPU reduction with GOGC tuning 400 millicores 180 millicores -55% CPU used Node.js has a lot of tuning flags as well (flaviocopes.com/node-runtime-v8-options)
  • 22. © 2023 Akamas • All Rights Reserved • Confidential How to solve this problem? Performance Engineering to the rescue!
  • 23. © 2023 Akamas • All Rights Reserved • Confidential The industry standard performance tuning process Analyze system performance Identify tuning parameters Change one parameter Test system with new config it’s manual, slow and error-prone, requires deep skills, doesn’t scale, is not continuous… Optimizing cloud-native applications requires a better approach!
  • 24. © 2023 Akamas • All Rights Reserved • Confidential Enter AI-driven Optimization
  • 25. © 2023 Akamas • All Rights Reserved • Confidential Autonomous optimization key capabilities
  • 26. © 2022 Akamas • All Rights Reserved • Confidential Autonomous optimization process
  • 27. © 2023 Akamas • All Rights Reserved • Confidential Optimization Studies Live Optimizations The Akamas Platform
  • 28. © 2023 Akamas • All Rights Reserved • Confidential Reducing cost of a Kubernetes microservice, while preserving app performance & reliability Demo
  • 29. © 2023 Akamas • All Rights Reserved • Confidential Key takeaways ● K8s enables unprecedented scalability & efficiency, but it’s not automatic ● Tuning is your responsibility - if you don’t tune, you don’t save! ● The biggest cost & reliability wins lie in K8s workload and app runtime layers - don’t rely on ergonomics! ● AI-powered optimization enables you to automate tuning and achieve savings at scale 1 2 3 4
  • 30. © 2023 Akamas • All Rights Reserved • Confidential Q&A
  • 31. Contacts info@akamas.io @AkamasLabs @akamaslabs Italy HQ Via Schiaffino 11 Milan, 20158 +39-02-4951-7001 USA East 211 Congress Street Boston, MA 02110 +1-617-936-0212 Singapore 5 Temasek Blvd Singapore 038985 USA West 12130 Millennium Drive Los Angeles, CA 90094 +1-323-524-0524 LinkedIn Twitter Email © 2023 Akamas • All Rights Reserved • Confidential