SlideShare a Scribd company logo
 
Graphing real-time
performance with
    Graphite
 Neal Anders - https://joind.in/650
whoami

Neal Anders
Senior Software Engineer at Infoblox
http://github.com/nanderoo
http://neal-anders.com
@nanderoo
 
shameless plug
Infoblox is working on some cool stuff...
- DNS, DHCP, IPAM, NCCM
- IPv6 Center of Excellence
- IF-Map / DNSSec
- Hiring (sales, services, support, engineering)
disclaimer
These thoughts and opinions are my own, and
not of my employer, bla bla bla...
whois $USER
Quick poll:
- Designers
- Developers
- Sys-Admins
- Networking
- Management
- Other...?
overview
What will we cover:
- What is Graphite?
- What data to capture
- Chart interpretation
but why
I worked at a place with major scale fail
- boxed vs service
- 100's of servers in multiple datacenters
- manual processes, shell scripts
- no insight into the app, infrastructure
- n-tier architecture
- on-call duties
- needed therapy, got it, didn't help
 
what is graphite
- Scalable real-time graphing system
- 3 main components:
  - Web front-end, graphite
  - Processing backend, carbon
  - Database, whisper
- Python based*
 
                              * It's good to learn other languages
what is graphite
Setup / Documentation:
- Easy to setup
- Decent documentation
- API and CLI access
what is graphite
What does it capture?
- Numeric time-series data...
 
   point       some.data.path
 
   value       3.2
 
   timestamp 1337690041 (epoch)
what is graphite
How much data?
- configurable
- precision
- retention period
- aggregation
 
 
what is graphite
what is graphite
Notes / gotchas:
- Scales horizontally
- Heavy on disk-io
- Fault tolerance
- Data loss
- Precision or Storage Space / io
what data to capture
...so what information should we capture?
 
..how detailed do we get?
 
..and does it have historical relevance?
 
..are just a few key metrics enough?
 
what data to capture
what data to capture
Thoughts on maximum vs. minimum:
- What information do you need to capture?
- Application Data (yes!)
- System Data: cpu, disk-io, mem usage
- Network: Connections? Latency? Packet loss?
- Fine-grained vs summary and aggregate?
what data to capture
In your app:
- function / method / calculation time
- template / content generation
- database query execution
- Internal and 3rd-party API calls
- queue sizes, processing times
- A/B testing?
what data to capture
From the systems:
- cpu
- disk usage
- io (disk, network interface)
- memory / paging / swap
- file handles
- log entries
what data to capture
At the network level:
- connection count
- socket state
- qos levels
- firewall stats
- cdn / cache response
- 3rd party status
chart interpretation
...it's like reading tea leaves...
 
...domains of knowledge leave gaps...
 
...thats not my job...
 
...forest through the trees...
chart interpretation
So what are we looking for:
- normality *
- deviations
- jitters
- historical performance
- double rainbows
 
* not present per Cal's keynote
chart interpretation
Because at 3am when you get paged...
 
Wouldn't it be great to correlate the site going
down... due to swapping... because of high
memory usage... thanks to that code that got
pushed... that had that change to how you
processed row results from a large database
query.
chart interpretation
Or that change window that just happened...
 
Where the security folks made some config
changes to one of the firewalls.. that is now
blocking your outbound API calls.. just from
some app servers in one of the datacenters..
chart interpretation
What about that new kernel that fixes a
memory leak...
 
Can you compare side by side, and with
historical context, what that looks like?
 
What about a physical machine vs a virtual
one?
chart interpretation
Do we need to retune our load-balancers, app
servers, or database replication?
 
Does higher site traffic over the past few
weeks show signs of strain?
 
Did that cache layer we add help any?
 
Is historical data choking once-fast pages?
demo
wordpress example
some final thoughts
-   come full circle, stats back in
-   this is one solution, there are others (statsd)
-   part of a larger tool bag
-   implement before big changes
-   establish a reference / baseline
-   suitable for dev, qa, and production
-   make implementing data capture easy
resources
http://graphite.wikidot.com
http://wordpress.org
http://memgenerator.net
http://www.flickr.com/groups/webopsviz/
 
..more resources available online..
 
 
feedback
joind.in - https://joind.in/6502
email - neal.anders@yahoo.com
 
fin




      Thank you.
Bonus
2001:1868:ad01:1::33

More Related Content

Similar to Tek12: Graphing real-time performance with Graphite

Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
Lew Tucker
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
DataWorks Summit
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
Vectorization whitepaper
Vectorization whitepaperVectorization whitepaper
Vectorization whitepaper
VIVEKSINGH634333
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than Speed
Inside Analysis
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
NAVER D2
 
EEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsEEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web Applications
Expertos en TI
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Sri Ambati
 

Similar to Tek12: Graphing real-time performance with Graphite (20)

Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Vectorization whitepaper
Vectorization whitepaperVectorization whitepaper
Vectorization whitepaper
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Making the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than SpeedMaking the Most of In-Memory: More than Speed
Making the Most of In-Memory: More than Speed
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
 
EEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web ApplicationsEEDC 2010. Scaling Web Applications
EEDC 2010. Scaling Web Applications
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
Sundar Ranganathan, NetApp + Vinod Iyengar, H2O.ai - Driverless AI integratio...
 

Recently uploaded

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Tek12: Graphing real-time performance with Graphite

  • 1.   Graphing real-time performance with Graphite Neal Anders - https://joind.in/650
  • 2. whoami Neal Anders Senior Software Engineer at Infoblox http://github.com/nanderoo http://neal-anders.com @nanderoo  
  • 3. shameless plug Infoblox is working on some cool stuff... - DNS, DHCP, IPAM, NCCM - IPv6 Center of Excellence - IF-Map / DNSSec - Hiring (sales, services, support, engineering)
  • 4. disclaimer These thoughts and opinions are my own, and not of my employer, bla bla bla...
  • 5. whois $USER Quick poll: - Designers - Developers - Sys-Admins - Networking - Management - Other...?
  • 6. overview What will we cover: - What is Graphite? - What data to capture - Chart interpretation
  • 7. but why I worked at a place with major scale fail - boxed vs service - 100's of servers in multiple datacenters - manual processes, shell scripts - no insight into the app, infrastructure - n-tier architecture - on-call duties - needed therapy, got it, didn't help  
  • 8. what is graphite - Scalable real-time graphing system - 3 main components: - Web front-end, graphite - Processing backend, carbon - Database, whisper - Python based*   * It's good to learn other languages
  • 9. what is graphite Setup / Documentation: - Easy to setup - Decent documentation - API and CLI access
  • 10. what is graphite What does it capture? - Numeric time-series data...   point some.data.path   value 3.2   timestamp 1337690041 (epoch)
  • 11. what is graphite How much data? - configurable - precision - retention period - aggregation    
  • 13. what is graphite Notes / gotchas: - Scales horizontally - Heavy on disk-io - Fault tolerance - Data loss - Precision or Storage Space / io
  • 14. what data to capture ...so what information should we capture?   ..how detailed do we get?   ..and does it have historical relevance?   ..are just a few key metrics enough?  
  • 15. what data to capture
  • 16. what data to capture Thoughts on maximum vs. minimum: - What information do you need to capture? - Application Data (yes!) - System Data: cpu, disk-io, mem usage - Network: Connections? Latency? Packet loss? - Fine-grained vs summary and aggregate?
  • 17. what data to capture In your app: - function / method / calculation time - template / content generation - database query execution - Internal and 3rd-party API calls - queue sizes, processing times - A/B testing?
  • 18. what data to capture From the systems: - cpu - disk usage - io (disk, network interface) - memory / paging / swap - file handles - log entries
  • 19. what data to capture At the network level: - connection count - socket state - qos levels - firewall stats - cdn / cache response - 3rd party status
  • 20. chart interpretation ...it's like reading tea leaves...   ...domains of knowledge leave gaps...   ...thats not my job...   ...forest through the trees...
  • 21. chart interpretation So what are we looking for: - normality * - deviations - jitters - historical performance - double rainbows   * not present per Cal's keynote
  • 22. chart interpretation Because at 3am when you get paged...   Wouldn't it be great to correlate the site going down... due to swapping... because of high memory usage... thanks to that code that got pushed... that had that change to how you processed row results from a large database query.
  • 23. chart interpretation Or that change window that just happened...   Where the security folks made some config changes to one of the firewalls.. that is now blocking your outbound API calls.. just from some app servers in one of the datacenters..
  • 24. chart interpretation What about that new kernel that fixes a memory leak...   Can you compare side by side, and with historical context, what that looks like?   What about a physical machine vs a virtual one?
  • 25. chart interpretation Do we need to retune our load-balancers, app servers, or database replication?   Does higher site traffic over the past few weeks show signs of strain?   Did that cache layer we add help any?   Is historical data choking once-fast pages?
  • 27. some final thoughts - come full circle, stats back in - this is one solution, there are others (statsd) - part of a larger tool bag - implement before big changes - establish a reference / baseline - suitable for dev, qa, and production - make implementing data capture easy
  • 30. fin Thank you.