CMG 101 - Understanding performance


Published on

Web performance is good, understanding performance is better.

What you need to understand in order to be able to have IT systems that perform well at a reasonable cost.

Published in: Technology, Design
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CMG 101 - Understanding performance

  1. 1. Performance is good,Understanding performance is better Peter HJ van Eijk Chairman NLCMG A non-profit community of professionals Feb 11, 2012
  2. 2. CMG 101 Computer Cloud Measurement GroupUnderstand:• Definitions of availability and response time• Psychological and business effect of delay/response time. User interfaces, cost of downtime• Transactions, and their structure.• Waterfall diagrams for transactions and web page downloads• Performance measures (seconds, bytes, bits per seconds, IOPS, etc).• Reporting measures / metrics.• Visualization of quantitative data, how to• Resources (CPU, memory, disk, network, software)• Elementary queuing theory• Phases in development and how to incorporate performance and capacity (analysis, design, etc.), performance engineering• Typical free and commercial tools, or at least their functionality – monitoring, reporting, alerting, analysis, modelling
  3. 3. Availability and Response Time• Availability: Ability of a Configuration Item or IT Service to perform its agreed Function when required. *…+ Availability is usually calculated as a percentage.• Response Time: A measure of the time taken to complete an Operation or Transaction
  4. 4. Graphs of availability and response time
  5. 5. Psychological and business cost of downtime €+$+£
  6. 6. Pageviews 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 1-jan-08 29-jan-08 26-feb-08 25-Mrt-2008 22-apr-08 20-mei-08 17-jun-08 15-jul-08 IceSave failure 12-aug-08 9-sep-08 Pageviews 7-okt-08Pageviews 4-nov-08 2-dec-08 30-dec-08 27-jan-09 24-feb-09 24-Mrt-2009 Sudden surges can kill you 21-apr-09 19-mei-09 Bron: SiteStat
  7. 7. Pageviews per hour180000160000140000 Weather alarm day120000100000 30-dec 31-dec 80000 60000 40000 Ordinary day 20000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
  8. 8. Transactions and their structure waterfall diagrams A single user level transaction decomposes into multiple transactions on componentsClient Server Yslow detail Query Netwerk latency Ack Server turnaround time Reply Ack
  9. 9. Transactions: from visits to bandwidth 1,7 visits/sec Visits Sitestat meting 6.380 /uur 7,42 pageviews per bezoek (volgens SiteStat), echter lager tijdens crisis79 GET per bezoek 13 pageviews/secvolgens logfile enSitestat Pageviews Sitestat meting, Serverlogs Pageopbouw via FireBug 47.338 /uur 10,6 (=79/7,42) GET/pageview effectief 32 GET voor homepage (volgens browser) GET requests HTTP Serverlogs 140 requests/sec Circa 6800 bytes per request gemiddeld HTTP Serverlogs 0,95 Mbyte/sec Bandwidth 9 7,6 Megabit/sec© Digital Infrastructures
  10. 10. How to diagnose a problem, where to look? Resource = capacity (Test) client WAN Link Users Router Switch (CPE) Firewall, Proxy Application LAN switchesEnd to end Load Balancer HTTP front end Server Network MySQL DB NAS Network lines SAN Example breakdowns
  11. 11. Resource contribution to response time, modeling different resource allocationsModelling different network bandwidth’s effect on response time Excessive client/server chatter leads to a user 64K interaction time of more 256K than 7 minutes! ICTRO 2Mb Op basis van 50 mSec GBO roundtrip op het WAN 0 100 200 300 400 500 How much faster will this be with? Server tijd (sec) Client tijd (sec) •Very fast network/ Netwerk tijd delay (sec) Netwerk tijd bandbreedte (sec) •Very fast client / Na het uitvragen van de medewerkersnummers (er zijn 373 Janssen’s), worden dienstverbanddetails per stuk uitgevraagd (in totaal 612). Dit leidt op het GBO LAN tot 30 sec doorlooptijd (gemeten). •Very fast server
  12. 12. Queuing theory Response depends on capacity At higher loads, congestion can set in Actual throughput 12 10Delay factor 8 Perfect 6 Sweet spot 4 Congestion 2 0 10% 20% 30% 40% 50% 60% 70% 80% 90% Sweet spot Utilisation Traffic load
  13. 13. So what was the bottleneck?• KNMI: static page served from database 1000/sec• Ministry: very chatty client/server interaction• DNB: JSP application server serves static content• Anne Frank: many, large digital assets, no use of CDN• Hospital information system: client (front-end) code
  14. 14. How to incorporate performance in development and operations
  15. 15. Typical free and commercial tools and their functionalityFunctionality Example tools• Monitoring • Nagios• Reporting • Cacti• Alerting • WatchMouse• Analysis • PDQ• Modelling • R• Etc … • Yslow • …
  16. 16. CMG 101• We want to develop a ‘standard’ body of knowledge – To educate our people – Speak more of the same language – Enable tool vendors to more easily express their offerings• Note: defining what is in the course is not the same as developing a course
  17. 17. Call for Action• Want to know more?• Want to collaborate, contribute?• Want to get a course?• Want to sponsor?• Talk to me Peter HJ van Eijk @petersgriddle +31 2268 4939 NLCMG is a chapter of
  18. 18. Some of my performance projects• KNMI (Weather service): website meltdown after weather emergency (“weeralarm”)• DNB (Dutch Banks Authority): website meltdown during 2008 financial crisis• Unnamed Ministry: information system with multi-minute response times• ….• Anne Frank website: … anticipated surge after major redesign• Hospital information system: storage sizing
  19. 19. Achtung alles Lookenspeepers! Nur watchen das Cloud. tch.html
  20. 20. How does a financial IT crisis look like?
  21. 21. Fernando’s office (bank’s capacity planner)