1 @Dynatrace
Application Quality Metrics for your Pipeline
(and why Docker is not the solution to all of your problems)
Andreas (Andi) Grabner - @grabnerandi
Metrics-Driven DevOps
700 deployments / year
10 + deployments / day
50 – 60 deployments / day
Every 11.6 seconds
Example #1: Online Casino 282! Objects
on that page9.68MB Page Size
8.8s Page
Load Time
Most objects are images
delivered from your main
domain
Very long Connect time
(1.8s) to your CDN
879! SQL Queries8!Missing CSS & JS
Files
340!Calls to GetItemById
Example #2: Lawyer Website based on SharePoint
11s!To load
Landing Page
• Waterfall  Agile: 3 years
• 220 Apps - 1 deployment per month
“EVERYONE can do Continuous Delivery”
“Every manual tester does AUTOMATION”
“WE DON’T LOG BUGS – WE FIX THEM!”
Measures Built-In, Visible to Everyone
Promote your Wins, Educate your Peers
Challenges
Fail Faster!?
Its not about blind automation of pushing more
bad code through a shiny pipeline
Metrics
based
Decision
Availability dropped to 0%
Bad Deployment based on Resource Consumption
With increasing load: Which
LAYER doesn’t SCALE?
App with Regular
Load supported by
10 Containers
Twice the Load but 48
(=4.8x!) Containers!
App doesn’t scale!!
Technical Debt!
80%
$60B
Insufficient Focus on Quality
The “War Room”
Facebook – December 2012
20%
80%
I
learning from
others
4 use cases
 WHY did it happen?
 HOW to avoid it!
 METRICS to guide you.
#1 : Not every
Architect
makes good
decisions
• Symptoms
• HTML takes between 60 and 120s to render
• High GC Time
• Developer Assumptions
• Bad GC Tuning
• Probably bad Database Performance as rendering was simple
• Result: 2 Years of Finger pointing between Dev and DBA
Project: Online Room Reservation System
Developers built own monitoring
void roomreservationReport(int officeId)
{
long startTime = System.currentTimeMillis();
Object data = loadDataForOffice(officeId);
long dataLoadTime = System.currentTimeMillis() - startTime;
generateReport(data, officeId);
}
Result:
Avg. Data Load Time: 45s!
DB Tool says:
Avg. SQL Query: <1ms!
#1: Loading too much data
24889! Calls to the Database
API!
High CPU and High Memory Usage
to keep all data in Memory
#2: On individual connections 12444!
individual
connections
Classical N+1 Query
Problem
Individual SQL
really <1ms
#3: Putting all data in temp Hashtable
Lots of time spent in
Hashtable.get
Called from their Entity
Objects
• … you know what code is doing you inherited!!
• … you are not making mistakes like this 
• Explore the Right Tools
• Built-In Database Analysis Tools
• “Logging” options of Frameworks such as Hibernate, …
• JMX, Perf Counters, … of your Application Servers
• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,
AppDynamics, Your Profiler of Choice …
Lessons Learned – Don’t Assume …
Key Metrics
# of SQL Calls
# of same SQL Execs (1+N)
# of Connections
Rows/Data Transferred
41 @Dynatrace
42 @Dynatrace
#2
There is no easy
"Migration" to
Micro(Services)
43 @Dynatrace
26.7s
Execution Time 33! Calls to the
same Web
Service
171! SQL Queries through LINQ
by this Web Service – request
similar data for each call
Architecture Violation: Direct access
to DB instead from frontend logic
44 @Dynatrace
Key Metrics
# Service Calls, # Containers
# of Threads, Sync and Wait
# SQL executions
# of SAME SQL’s
Payload (kB) of Service Calls
45 @Dynatrace
46 @Dynatrace
#3
don't ASSUME you
know the environment
Distance calculation issues
480km biking
in 1 hour!
Solution: Unit Test in
Live App reports Geo
Calc Problems
Finding: Only
happens on certain
Android versions
3rd party issues
Impact of bad
3rd party calls
49 @Dynatrace
Key Metrics
# of functional errors
# and Status of 3rd party calls
Payload of Calls
51 @Dynatrace
#4
Thinking Big?
Then Start Small!
52 @Dynatrace
Load Spike resulted in Unavailability
Adonair
53 @Dynatrace
Alternative: “GoDaddy goes DevOps”
1h before
SuperBowl KickOff
1h after
Game ended
54 @Dynatrace
Key Metrics
# Domains
Total Size of Content
55 @Dynatrace
What have we
learned so far?
56 @Dynatrace
1. # Resources
2. Size of Resources
3. Page Size
4. # Functional Errors
5. 3rd Party calls
6. # SQL Executions
7. # of SAME SQLs
Metric
Based
Decisions
Are Cool
We want to get from here …
To here!
Use these application metrics as additional
Quality Gates
60
What you currently measure
What you should measure
Quality Metrics
in your pipeline
# Test Failures
Overall Duration
Execution Time per test
# calls to API
# executed SQL statements
# Web Service Calls
# JMS Messages
# Objects Allocated
# Exceptions
# Log Messages
# HTTP 4xx/5xx
Request/Response Size
Page Load/Rendering Time
…
Extend your Continuous Integration
12 0 120ms
3 1 68ms
Build 20 testPurchase OK
testSearch OK
Build 17 testPurchase OK
testSearch OK
Build 18 testPurchase FAILED
testSearch OK
Build 19 testPurchase OK
testSearch OK
Build # Test Case Status # SQL # Excep CPU
12 0 120ms
3 1 68ms
12 5 60ms
3 1 68ms
75 0 230ms
3 1 68ms
Test & Monitoring Framework Results Architectural Data
We identified a regresesion
Problem solved
Exceptions probably reason for
failed tests
Problem fixed but now we have an
architectural regression
Problem fixed but now we have an
architectural regressionNow we have the functional and
architectural confidence
Let’s look behind the scenes
#1: Analyzing every Unit
& Integration test
#2: Metrics for each test
#3: Detecting regression
based on measure
Unit/Integration Tests are auto baselined! Regressions auto-detected!
Build-by-Build Quality View
Build Quality Overview in
Dynatrace or Jenkins
Build Quality Overview in
Dynatrace & your CI server
Production Data: Real User & Application Monitoring
Recap!
#1: Pick your App Metrics
# of Service Calls Bytes Sent & Received
# of Worker
Threads
# of Worker
Threads
# of SQL Calls, # of
Same SQLs # of DB
Connections
# of SQL Calls, # of
Same SQLs # of DB
Connections
#2: Figure out how to monitor them
http://bit.ly/dtpersonal
#3: Automate it into your Pipeline
#4: Also do it in Production
Draw better Unicorns 
75 @Dynatrace
Questions and/or Demo
Slides: slideshare.net/grabnerandi
Get Tools: bit.ly/dtpersonal
YouTube Tutorials: bit.ly/dttutorials
Contact Me: agrabner@dynatrace.com
Follow Me: @grabnerandi
Read More: blog.dynatrace.com
76 @Dynatrace
Andreas Grabner
Dynatrace Developer Advocate
@grabnerandi
http://blog.dynatrace.com

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

  • 1.
    1 @Dynatrace Application QualityMetrics for your Pipeline (and why Docker is not the solution to all of your problems) Andreas (Andi) Grabner - @grabnerandi Metrics-Driven DevOps
  • 4.
    700 deployments /year 10 + deployments / day 50 – 60 deployments / day Every 11.6 seconds
  • 5.
    Example #1: OnlineCasino 282! Objects on that page9.68MB Page Size 8.8s Page Load Time Most objects are images delivered from your main domain Very long Connect time (1.8s) to your CDN
  • 6.
    879! SQL Queries8!MissingCSS & JS Files 340!Calls to GetItemById Example #2: Lawyer Website based on SharePoint 11s!To load Landing Page
  • 10.
    • Waterfall Agile: 3 years • 220 Apps - 1 deployment per month “EVERYONE can do Continuous Delivery” “Every manual tester does AUTOMATION” “WE DON’T LOG BUGS – WE FIX THEM!” Measures Built-In, Visible to Everyone Promote your Wins, Educate your Peers
  • 11.
  • 13.
  • 16.
    Its not aboutblind automation of pushing more bad code through a shiny pipeline
  • 17.
  • 18.
  • 19.
    Bad Deployment basedon Resource Consumption
  • 20.
    With increasing load:Which LAYER doesn’t SCALE?
  • 22.
    App with Regular Loadsupported by 10 Containers Twice the Load but 48 (=4.8x!) Containers! App doesn’t scale!!
  • 23.
  • 24.
  • 25.
  • 26.
    The “War Room” Facebook– December 2012
  • 28.
  • 30.
  • 31.
    4 use cases WHY did it happen?  HOW to avoid it!  METRICS to guide you.
  • 33.
    #1 : Notevery Architect makes good decisions
  • 34.
    • Symptoms • HTMLtakes between 60 and 120s to render • High GC Time • Developer Assumptions • Bad GC Tuning • Probably bad Database Performance as rendering was simple • Result: 2 Years of Finger pointing between Dev and DBA Project: Online Room Reservation System
  • 35.
    Developers built ownmonitoring void roomreservationReport(int officeId) { long startTime = System.currentTimeMillis(); Object data = loadDataForOffice(officeId); long dataLoadTime = System.currentTimeMillis() - startTime; generateReport(data, officeId); } Result: Avg. Data Load Time: 45s! DB Tool says: Avg. SQL Query: <1ms!
  • 36.
    #1: Loading toomuch data 24889! Calls to the Database API! High CPU and High Memory Usage to keep all data in Memory
  • 37.
    #2: On individualconnections 12444! individual connections Classical N+1 Query Problem Individual SQL really <1ms
  • 38.
    #3: Putting alldata in temp Hashtable Lots of time spent in Hashtable.get Called from their Entity Objects
  • 39.
    • … youknow what code is doing you inherited!! • … you are not making mistakes like this  • Explore the Right Tools • Built-In Database Analysis Tools • “Logging” options of Frameworks such as Hibernate, … • JMX, Perf Counters, … of your Application Servers • Performance Tracing Tools: Dynatrace, Ruxit, NewRelic, AppDynamics, Your Profiler of Choice … Lessons Learned – Don’t Assume …
  • 40.
    Key Metrics # ofSQL Calls # of same SQL Execs (1+N) # of Connections Rows/Data Transferred
  • 41.
  • 42.
    42 @Dynatrace #2 There isno easy "Migration" to Micro(Services)
  • 43.
    43 @Dynatrace 26.7s Execution Time33! Calls to the same Web Service 171! SQL Queries through LINQ by this Web Service – request similar data for each call Architecture Violation: Direct access to DB instead from frontend logic
  • 44.
    44 @Dynatrace Key Metrics #Service Calls, # Containers # of Threads, Sync and Wait # SQL executions # of SAME SQL’s Payload (kB) of Service Calls
  • 45.
  • 46.
    46 @Dynatrace #3 don't ASSUMEyou know the environment
  • 47.
    Distance calculation issues 480kmbiking in 1 hour! Solution: Unit Test in Live App reports Geo Calc Problems Finding: Only happens on certain Android versions
  • 48.
    3rd party issues Impactof bad 3rd party calls
  • 49.
    49 @Dynatrace Key Metrics #of functional errors # and Status of 3rd party calls Payload of Calls
  • 51.
  • 52.
    52 @Dynatrace Load Spikeresulted in Unavailability Adonair
  • 53.
    53 @Dynatrace Alternative: “GoDaddygoes DevOps” 1h before SuperBowl KickOff 1h after Game ended
  • 54.
    54 @Dynatrace Key Metrics #Domains Total Size of Content
  • 55.
    55 @Dynatrace What havewe learned so far?
  • 56.
    56 @Dynatrace 1. #Resources 2. Size of Resources 3. Page Size 4. # Functional Errors 5. 3rd Party calls 6. # SQL Executions 7. # of SAME SQLs Metric Based Decisions Are Cool
  • 57.
    We want toget from here …
  • 58.
  • 59.
    Use these applicationmetrics as additional Quality Gates
  • 60.
    60 What you currentlymeasure What you should measure Quality Metrics in your pipeline # Test Failures Overall Duration Execution Time per test # calls to API # executed SQL statements # Web Service Calls # JMS Messages # Objects Allocated # Exceptions # Log Messages # HTTP 4xx/5xx Request/Response Size Page Load/Rendering Time …
  • 61.
    Extend your ContinuousIntegration 12 0 120ms 3 1 68ms Build 20 testPurchase OK testSearch OK Build 17 testPurchase OK testSearch OK Build 18 testPurchase FAILED testSearch OK Build 19 testPurchase OK testSearch OK Build # Test Case Status # SQL # Excep CPU 12 0 120ms 3 1 68ms 12 5 60ms 3 1 68ms 75 0 230ms 3 1 68ms Test & Monitoring Framework Results Architectural Data We identified a regresesion Problem solved Exceptions probably reason for failed tests Problem fixed but now we have an architectural regression Problem fixed but now we have an architectural regressionNow we have the functional and architectural confidence Let’s look behind the scenes
  • 62.
    #1: Analyzing everyUnit & Integration test #2: Metrics for each test #3: Detecting regression based on measure Unit/Integration Tests are auto baselined! Regressions auto-detected!
  • 63.
    Build-by-Build Quality View BuildQuality Overview in Dynatrace or Jenkins Build Quality Overview in Dynatrace & your CI server
  • 64.
    Production Data: RealUser & Application Monitoring
  • 68.
  • 69.
    #1: Pick yourApp Metrics # of Service Calls Bytes Sent & Received # of Worker Threads # of Worker Threads # of SQL Calls, # of Same SQLs # of DB Connections # of SQL Calls, # of Same SQLs # of DB Connections
  • 70.
    #2: Figure outhow to monitor them http://bit.ly/dtpersonal
  • 71.
    #3: Automate itinto your Pipeline
  • 72.
    #4: Also doit in Production
  • 74.
  • 75.
    75 @Dynatrace Questions and/orDemo Slides: slideshare.net/grabnerandi Get Tools: bit.ly/dtpersonal YouTube Tutorials: bit.ly/dttutorials Contact Me: agrabner@dynatrace.com Follow Me: @grabnerandi Read More: blog.dynatrace.com
  • 76.
    76 @Dynatrace Andreas Grabner DynatraceDeveloper Advocate @grabnerandi http://blog.dynatrace.com

Editor's Notes

  • #2 Get Dynatrace Free Trial at http://bit.ly/dtpersonal Video Tutorials on YouTube Channel: http://bit.ly/dttutorials Online Webinars every other week: http://bit.ly/onlineperfclinic Share Your PurePath with me: http://bit.ly/sharepurepath More blogs on http://blog.dynatrace.com
  • #3 If you are new to DevOps and Continuous Delivery check out these two books: Continuous Delivery from Jez Humble, David Farley and The Phoenix Project from Gene Kim, Kevin Behr, and George Spafford
  • #4 Many companies that have a „DevOps Strategy“ too often just follow the Unicorns
  • #5 Several companies changed their way they develop and deploy software over the years. Here are some examples (numbers from 2011 – 2014) Cars: from 2 deployments to 700 Flicks: 10+ per Day Etsy: lets every new employee on their first day of employment make a code change and push it through the pipeline in production: THAT’S the right approach towards required culture change Amazon: every 11.6s Remember: these are very small changes – which is also a key goal of continuous delivery. The smaller the change the easier it is to deploy, the less risk it has, the easier it is to test and the easier is it to take it out in case it has a problem.
  • #6 If „Being DevOps“ just means you just increase the number of deployments then you are bound to fail. Here is an example of a bad web application. When deploying this more frequently you will end up in more war rooms
  • #7 Another example from a SharePoint app that allows production deployments by SharePoint Admins. A simply change directly in production can have very negative impacts, e.g: deploying a new WebPart with a Data-Driven Performance Hotspot
  • #8 Don‘t just copy the Unicorns – dont be just driven the number of deployments.
  • #9 The problem is though – when you blindly copy what you read you may end up with a very ugly copy of a Unicorn. Its not about copying everything or thinking that you have to release as frequently as the Unicorns. It is about changing and adapting a lot of their best practices but doing it in a way that makes sense to you. For you it might be enough to release once a month or once week.
  • #10 Listen to the next generation Unicorns, e.g: those talking at Velocity or other conferences: Target, CapitalOne, IG, ...
  • #11 These are the highlights of these talks for me this year: http://apmblog.dynatrace.com/2015/05/27/velocity-2015-our-conference-highlights/ http://apmblog.dynatrace.com/2015/05/28/velocity-2015-highlights-from-day-2/ http://apmblog.dynatrace.com/2015/05/29/velocity-2015-highlights-from-last-day/
  • #12 Despite all these stories the main Challenge remains ...
  • #13 Don’t’ just try to deploy faster …
  • #14 … as you may just ending up failing faster and more often!
  • #15 Don’t become the next headline on the news as United in the summer of 2015
  • #16 Or the Fifa World Cup App that crashed for 80% of their Android Users caused by a memory leak in an outdated UI Library one week before the WorldCup
  • #18 I love metrics – and I think we should make decisions on deployments based on key metrics. But also monitor deployments in production to learn whether the deployment was really good
  • #19 The BASIC Metric EVERYONE has to have: Synthetic Availability Monitoring -> Clearly something went wrong
  • #20 Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power: http://apmblog.dynatrace.com/2015/06/30/fighting-technical-debt-memory-leak-detection-in-production/ http://apmblog.dynatrace.com/2014/10/28/hands-tutorial-5-steps-identify-java-net-memory-leaks/
  • #21 Layer Breakdown perfectly shows which layer of your app is not scaling: http://apmblog.dynatrace.com/2015/01/22/key-performance-metrics-load-tests-beyond-response-time-part/
  • #22 Got a marketing campaign? If you roll it out do it smart: Start with a small number – monitor user behavior – fix errors if there are any before rolling out the rest of the campaign: http://apmblog.dynatrace.com/2015/02/26/omni-channel-monitoring-in-real-life/
  • #23 http://apmblog.dynatrace.com/2015/11/30/last-minute-rescue-for-black-friday-business/
  • #24 A lot of people dont look at these metrics and just add new code on an ever growing big pile of technical debt
  • #25 Based on a recent study: 80% of Dev Team overall is spent in Bugfixing instead of building new cool features $60B annual costs of bad software instead of investing it in new cool features to spearhead competition
  • #26 Yes – we are focusing on quality TOO LATE
  • #27 When its too late we end up here
  • #28 We need to leave that status quo. And there are two numbers that tell us that it is not as hard to do as it may seem
  • #29 Based on my experience 80% of the problems are only caused by 20% problem patterns. And focusing on 20% of potential problems that take away 80% of the pain is a very good starting point
  • #30 Sounds super nice on paper – so – how do we get there?
  • #35 This story is from Joe – a DB guy from a very large telco arguing with his developers over performance problems of an online room reservation system which has evolved from a small project implemented by an intern to an application that is now used in their entire organization
  • #36 Devs buillt custom monitoring to proof their point! Contradicting what Joe‘s DB Tools had to say
  • #37 Reading this Transaction Flow showed what the real problem was: Loading Too Much Data from the Database causing High Memory Usage and therefore high CPU to cleanup the garbage
  • #38 Every SQL was executed on its on Connection
  • #39 The intern back then implemented its own OR Mapper by loading the full database content into a HashTable using individual queries
  • #44 This was a monolithic app for searching sports club websites. The executed sample search brought 33 sports club. Before this app was „migrated“ to Microservices everything was in a single monolith taking about 1s to execute. After the „migration“ to (micro)services the same call takes 26.7s including 33 calls to the new microservice and 171 roundtrips to the database
  • #48 A Mobile App with a GPS Distance Calculation Problem. Couldnt be found in test – so they moved the Test to Production to find out which devices actually have the problem http://apmblog.dynatrace.com/2013/07/23/too-fast-for-the-user/
  • #49 As many mobile apps – you might rely on 3rd party services for your users to login. Make sure you monitor the response time and success of these calls and how it impacts your end users
  • #53 Overloaded Kia website brings it down during superbowl: http://apmblog.dynatrace.com/2014/03/05/bloated-web-pages-can-make-or-break-the-day-lessons-learned-from-super-bowl-advertisers/
  • #54 GoDaddy is doing something different: they have a special „bare minimum static optimized“ website for the spike period -> thats smart: http://apmblog.dynatrace.com/2014/02/19/dns-tcp-and-size-application-performance-best-practices-of-super-bowl-advertisers/
  • #57 So – we have seen a lot of metrics. The goal now is that you start with one metric. Pick a single metric and take it back to your engineering team (Dev, Test, Ops and Business). Sit down and agree on what this metric means for everyone, how to measure it and also how to report it Also remember that for most of these use cases discussed and metrics derived from it we only need a single user test. Even though we can identify performance, scalability and architectural issues – in most cases we don’t need a load test. Single user tests or unit tests are good enough
  • #61 If you are already executing tests than that is great – BUT – you are only testing functionality. It is time to look „underneath“ the hood and automaitcally find all these other problems we just talked about by looking at the right metrics
  • #62 Here is how we do this. In addition to looking at functional and unit test results which only tell us how functionality is we also look into these backed metrics for every test. With that we can immediately identify whether code changes result in any performance, scalability or architectural regressions. Knowing this allows us to stop that build early
  • #63 This is how this can look like in a real life example. Analyzing Key Performance, Scalability and Architectural Metrics for every single test
  • #64 Dynatrace can either show the data in our own dashboards or you can integrate this data through our REST APIs with your Build Server such as Jenkins, Bamboo, .... And even „BREAK THE BUILD“ if something is bad!
  • #65 Make sure you do not end in Pre-Production. Once you deploy your application you also want to monitor how your application is doing in the wild. Same technical metrics are important to monitor but also correlate them with the business metrics such as Conversion Rates, Bounce Rates, Revenue, ...
  • #66 Docker Fans: Make sure you monitor your Docker Enviornments to identify any bottlenecks – whether caused by Docker or by your app making inefficient use of Docker/Container resources! http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
  • #67 More screenshots and tips and tricks on docker/container monitoring http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
  • #68 A „dockerized“ app monitored with Dynatrace http://apmblog.dynatrace.com/2015/07/21/how-to-get-visibility-into-docker-clusters-running-kubernetes/
  • #74 So – our goal is to deploy new features faster to get it in front of our paying end users or employees
  • #75 Become the next generation Unicorn!