SlideShare a Scribd company logo
1 of 35
The sFlow standard:
    scalable, unified
 monitoring of networks,
systems and applications
    Dave Mangot (Tagged Inc.)
        tech.mangot.com
    Peter Phaal (InMon Corp.)
          blog.sflow.com
Tagged Inc.
   Social Networking
   5 billion page views a month
   4 TB of main memcached
   Heavy use of Apache/PHP and Java
   Ganglia critical to business function
   Puppet for configuration management
InMon Corp.
 Performance management software developer
 Originators of the sFlow standard
 Founding member of sFlow.org
 Initial implementation and contributor to Host sFlow and related projects
 - Memcached sFlow patch
 - Apache mod_sflow
 - NGINX sFlow module
 - sFlow Java Agent
 Contributed sFlow support to Ganglia project
Challenge: Monitoring large, scale-out, multi-tiered sites

            Load                                      Memcache
         Balancer            Server
                                Web
                              Server
          Balancer             Server

                           Network                Application    Database
                                                    Server

 Large number of servers in each pool
 Servers constantly being added/removed
 Network performance is critical
 - scale-out applications dependent on network performance
 - potential for propagating failures between tiers
Challenge: Monitoring large, scale-out, multi-tiered sites

            Load                                      Memcache
         Balancer           Server
                               Web
                             Server
          Balancer            Server

                           Network                Application    Database
                                                    Server

 Large number of servers in each pool
 Servers constantly being added/removed
 Network performance is critical
 - scale-out applications dependent on network performance
 - potential for propagating failures between tiers
sFlow is the industry standard for monitoring switches
Open source sFlow agents for hosts and applications
sFlow exports standard counters
 Network (maintained by hardware in network devices)
  - MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos,
    ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors
 Host (maintained by operating system kernel)
  - CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice, cpu_system,
    cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts
  - Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out,
    swap_in, swap_out
  - Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time
  - Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out
 Application (maintained by application)
  - HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count,
    method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count, status_2xx_count,
    status_3xx_count, status_4xx_count, status_5xx_count, status_other_count
  - Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses,
    decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields, listen_disabled_num,
    curr_connections, rejected_connections, total_connections, connection_structures, evictions, reclaimed, curr_items, total_items,
    bytes_read, bytes_written, bytes, limit_maxbytes
sFlow’s scalable “push” protocol
 Simple
 - standard structures - densely packed blocks of counters
 - extensible (tag, length, value)
 - RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode
 - unicast UDP transport
 Minimal configuration
 - collector address
 - polling interval
 Cloud friendly
 - flat, two tier architecture: many embedded agents → central “smart” collector
 - sFlow agents automatically start sending metrics on startup, automatically discovered
 - eliminates complexity of maintaining polling daemons (and their associated configurations)
Example
 Collect 50 metrics per server
 Every 30 seconds
 From 100,000 servers
 100,000 / 30 ≈ 3,333 sFlow datagrams per second
Example
 Collect 50 metrics per server
 Every 30 seconds
 From 100,000 servers
 100,000 / 30 ≈ 3,333 sFlow datagrams per second




 Single sFlow analyzer can monitor entire data center!
Counters aren’t enough
                                                       Counters tell you there is a problem, but
                                                        not why.
                                                       Counters summarize performance by
                                                        dropping high cardinality attributes:
                                                       - IP addresses
                                                       - URLs
                                                       - Memcache keys
                                                       Need to be able to efficiently disaggregate
                                                        counter by attributes in order to
                                                        understand root cause of performance
                                                        problems.
                                                       How do you get this data when there are
          Why the spike in traffic?                      millions of transactions per second?
  (100Gbit link carrying 14,000,000 packets/second)
sFlow also exports random samples
                                                                Random sampling is lightweight
                                                                  - critical path roughly cost of maintaining one
                                                                    counter:
                                                                    if(--skip == 0) sample();
                                                                  - sampling is easy to distribute among
                                                                    modules, threads, processes without any
                                                                    synchronization
                                                                  - minimal resources required to capture
                                                                    attributes of sampled transactions
                                                                Easily identify top keys, connections,
                                                                 clients, servers, URLs etc.
                                                                Unbiased results with known accuracy

   Break out traffic by client, server and port
(graph based on samples from100Gbit link carrying 14,000,000 packets/second)
Big Picture: Comprehensive, multi-layer visibility
                 Apache/PHP                 Memcached
 Applications                 Tomcat/Java



 Virtual Servers

 Virtual Network

 Servers

 Network

     Embedded monitoring of all switches,                Consistent measurements shared
     all servers, all applications, all the time        between multiple management tools
Tagged Uses sFlow!

   Apache via mod_sflow
   Java via sflowagent (-agent sflowagent.jar)
   Memcached via source patches
   Host sFlow
sFlow + Ganglia



                  make a much better graphic
                  integration with Ganglia
                  deployed via Puppet
HTTP

 response codes (200, 300, 400, etc.)
 method (GET, HEAD, etc.)
 URL duration, frequency, bytes
View the stack at once
Not just GETs!
Apache URLs by Duration
Slice URLs how YOU want
Memcached

   Hits/Misses
   Operations (GET, SETS, etc.)
   Traffic bytes, duration, operations
   Top Keys
Cold Cache Ramp Up
Protect the DB
Protect the DB
Sees the keys
Java (e.g. Tomcat)

 Heap/Non-Heap Utilization
 File descriptors
 GC compilation & timings/counts
 Classes Loaded/Unloaded
 Threads
Reap the Heap
Not just heap!
TCP +
sflowtool

 Open Source
 Command Line
 Understands tcpdump!
 Output: delimited text
Thanks!
 The Ganglia Team
 The SiteOps team @ Tagged & Tagged Inc.
 Bay Area LSPE Meetup - actually meeting tonight!
 TubeMogul
 PayPal
 O’Reilly
 http://clipart-for-free.blogspot.com/2008/06/free-truck-
  clipart.html
Questions?



       We are also doing office hours
       today @ 2:30 in the exhibit hall!

More Related Content

What's hot

Performance Profiling of Virtual Machines
Performance Profiling of Virtual MachinesPerformance Profiling of Virtual Machines
Performance Profiling of Virtual Machines
Jiaqing Du
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 

What's hot (9)

Lisa12 methodologies
Lisa12 methodologiesLisa12 methodologies
Lisa12 methodologies
 
Performance Profiling of Virtual Machines
Performance Profiling of Virtual MachinesPerformance Profiling of Virtual Machines
Performance Profiling of Virtual Machines
 
Alibaba cloud benchmarking report ecs rds limton xavier
Alibaba cloud benchmarking report ecs  rds limton xavierAlibaba cloud benchmarking report ecs  rds limton xavier
Alibaba cloud benchmarking report ecs rds limton xavier
 
Nachos
NachosNachos
Nachos
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
HeartBeat
HeartBeatHeartBeat
HeartBeat
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 

Similar to The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications

Arch stylesandpatternsmi
Arch stylesandpatternsmiArch stylesandpatternsmi
Arch stylesandpatternsmi
lord14383
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
Haseeb Alam
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
Holger Winkelmann
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replication
ncct
 
Know More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy KKnow More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy K
Roopa Nadkarni
 
3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k
IBM
 

Similar to The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications (20)

Arch stylesandpatternsmi
Arch stylesandpatternsmiArch stylesandpatternsmi
Arch stylesandpatternsmi
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
 
Java Abs Dynamic Server Replication
Java Abs   Dynamic Server ReplicationJava Abs   Dynamic Server Replication
Java Abs Dynamic Server Replication
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Amoeba
AmoebaAmoeba
Amoeba
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Know More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy KKnow More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy K
 
3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 

The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications

  • 1. The sFlow standard: scalable, unified monitoring of networks, systems and applications Dave Mangot (Tagged Inc.) tech.mangot.com Peter Phaal (InMon Corp.) blog.sflow.com
  • 2.
  • 3. Tagged Inc.  Social Networking  5 billion page views a month  4 TB of main memcached  Heavy use of Apache/PHP and Java  Ganglia critical to business function  Puppet for configuration management
  • 4. InMon Corp.  Performance management software developer  Originators of the sFlow standard  Founding member of sFlow.org  Initial implementation and contributor to Host sFlow and related projects - Memcached sFlow patch - Apache mod_sflow - NGINX sFlow module - sFlow Java Agent  Contributed sFlow support to Ganglia project
  • 5. Challenge: Monitoring large, scale-out, multi-tiered sites Load Memcache Balancer Server Web Server Balancer Server Network Application Database Server  Large number of servers in each pool  Servers constantly being added/removed  Network performance is critical - scale-out applications dependent on network performance - potential for propagating failures between tiers
  • 6. Challenge: Monitoring large, scale-out, multi-tiered sites Load Memcache Balancer Server Web Server Balancer Server Network Application Database Server  Large number of servers in each pool  Servers constantly being added/removed  Network performance is critical - scale-out applications dependent on network performance - potential for propagating failures between tiers
  • 7. sFlow is the industry standard for monitoring switches
  • 8. Open source sFlow agents for hosts and applications
  • 9. sFlow exports standard counters  Network (maintained by hardware in network devices) - MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos, ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors  Host (maintained by operating system kernel) - CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice, cpu_system, cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts - Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out, swap_in, swap_out - Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time - Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out  Application (maintained by application) - HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count, method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count, status_2xx_count, status_3xx_count, status_4xx_count, status_5xx_count, status_other_count - Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses, decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields, listen_disabled_num, curr_connections, rejected_connections, total_connections, connection_structures, evictions, reclaimed, curr_items, total_items, bytes_read, bytes_written, bytes, limit_maxbytes
  • 10. sFlow’s scalable “push” protocol  Simple - standard structures - densely packed blocks of counters - extensible (tag, length, value) - RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode - unicast UDP transport  Minimal configuration - collector address - polling interval  Cloud friendly - flat, two tier architecture: many embedded agents → central “smart” collector - sFlow agents automatically start sending metrics on startup, automatically discovered - eliminates complexity of maintaining polling daemons (and their associated configurations)
  • 11. Example  Collect 50 metrics per server  Every 30 seconds  From 100,000 servers  100,000 / 30 ≈ 3,333 sFlow datagrams per second
  • 12. Example  Collect 50 metrics per server  Every 30 seconds  From 100,000 servers  100,000 / 30 ≈ 3,333 sFlow datagrams per second Single sFlow analyzer can monitor entire data center!
  • 13. Counters aren’t enough  Counters tell you there is a problem, but not why.  Counters summarize performance by dropping high cardinality attributes: - IP addresses - URLs - Memcache keys  Need to be able to efficiently disaggregate counter by attributes in order to understand root cause of performance problems.  How do you get this data when there are Why the spike in traffic? millions of transactions per second? (100Gbit link carrying 14,000,000 packets/second)
  • 14. sFlow also exports random samples  Random sampling is lightweight - critical path roughly cost of maintaining one counter: if(--skip == 0) sample(); - sampling is easy to distribute among modules, threads, processes without any synchronization - minimal resources required to capture attributes of sampled transactions  Easily identify top keys, connections, clients, servers, URLs etc.  Unbiased results with known accuracy Break out traffic by client, server and port (graph based on samples from100Gbit link carrying 14,000,000 packets/second)
  • 15. Big Picture: Comprehensive, multi-layer visibility Apache/PHP Memcached Applications Tomcat/Java Virtual Servers Virtual Network Servers Network Embedded monitoring of all switches, Consistent measurements shared all servers, all applications, all the time between multiple management tools
  • 16. Tagged Uses sFlow!  Apache via mod_sflow  Java via sflowagent (-agent sflowagent.jar)  Memcached via source patches  Host sFlow
  • 17. sFlow + Ganglia make a much better graphic integration with Ganglia deployed via Puppet
  • 18. HTTP  response codes (200, 300, 400, etc.)  method (GET, HEAD, etc.)  URL duration, frequency, bytes
  • 19. View the stack at once
  • 21. Apache URLs by Duration
  • 22. Slice URLs how YOU want
  • 23. Memcached  Hits/Misses  Operations (GET, SETS, etc.)  Traffic bytes, duration, operations  Top Keys
  • 28. Java (e.g. Tomcat)  Heap/Non-Heap Utilization  File descriptors  GC compilation & timings/counts  Classes Loaded/Unloaded  Threads
  • 31. TCP +
  • 32.
  • 33. sflowtool  Open Source  Command Line  Understands tcpdump!  Output: delimited text
  • 34. Thanks!  The Ganglia Team  The SiteOps team @ Tagged & Tagged Inc.  Bay Area LSPE Meetup - actually meeting tonight!  TubeMogul  PayPal  O’Reilly  http://clipart-for-free.blogspot.com/2008/06/free-truck- clipart.html
  • 35. Questions? We are also doing office hours today @ 2:30 in the exhibit hall!

Editor's Notes

  1. * sounds funny to say “standard”\n* repeatable & consistent, transport and approach, apply to each instrumented protocol\n
  2. * met Peter giving a talk on Graphite, needed metrics for Graphs\n* “Who here has networking gear by a vendor not named Cisco? Who here has used sFlow on their switches or Hosts?”\n* my history with sflow\n* super easy to integrate in Graphite talk with a little Perl\n* Tagged fanatical about monitoring, Peter wanted to validate approach so it is a good match (he is also fanatical about sflow), so much so, I asked him if he’d thought about supporting node.js, had it working the next day!\n* You can ask me about the 404s in the hall or at office hours\n\n
  3. * been relying on OPEN SOURCE Host sFlow almost 1 year\n1) automatic visibility into applications\n2) network more efficient (PPS)\n* Welcome Peter\n\n
  4. * one of the authors of the sFlow standard\n* InMon develops performance management software, \n* contributes to sFlow related projects\n* introduction to sFlow, put context behind examples Dave will present\n
  5. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  6. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  7. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  8. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  9. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  10. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  11. * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  12. * switch vendors embed instrumentation in their hardware\n* sFlow standard was developed by switch vendors to ensure interoperability\n* today, most vendors support sFlow\n* network visibility is a matter of selecting devices with sFlow support\n* recently, sFlow standard extended to include server and application performance\n\n\n
  13. * Host sFlow is an open source agent that exports server metrics\n* core of an ecosystem of related open source projects\n* integrate monitoring into an increasing range of applications\n* seen current scope of sFlow implementations\n* let’s take a look at types of measurement that sFlow provides\n
  14. * don’t worry - I don’t expect you to read this slide\n* counters are a staple of network and system management \n* counters are maintained by switch hardware, operating systems and applications \n* counters aren’t useful if they are stranded within each device \n* sFlow provides an efficient way to collect counters from large numbers of devices\n* makes performance information actionable\n
  15. * sFlow is a simple protocol\n* each of the blocks of counters from previous slide are efficiently encoded using XDR\n* sent as UDP datagrams to an sFlow analyzer\n* each datagram can carry hundreds of individual metrics\n* minimal configuration: IP address of the collector and a polling interval. \n* cloud environments: hosts constantly added, removed, started and stopped\n* challenge: maintaining lists of devices to poll for statistics\n* sFlow: each device automatically sends metrics as soon it starts up\n* devices immediately detected and continuously monitored\n
  16. * example: 50 metrics, every 30 seconds from 100,000 servers\n* three thousand sFlow datagrams per second\n* easily decoded and processed\n* storing and querying takes a little more effort\n* easily managed by a single server\n
  17. * metrics are extremely useful for characterizing performance\n* operations dashboards covered with trend charts\n* trend chart summarizes vast amounts of information\n* example: chart for link carrying over 14 million packets per second\n* nearly 1 billion packets are summarized in each data point shown on the graph \n* detect a spike - where do you go next?\n
  18. * random sampling is an integral part of sFlow monitoring\n* overhead of maintaining one additional counter\n* details of transaction attributes, data volumes, response times and status codes \n* example: 3 network connections showing client, server, protocol and traffic* understand the increase in traffic and plan actions\n* sampling applies equally to HTTP requests, Memcache operations etc.\n* Dave will be presenting additional examples later in this talk\n* stepping back: sFlow allows pervasive instrumentation of the data center\n
  19. * embedding instrumentation reduces operational complexity\n* deploys with services\n* ensures all resources continuously monitored\n* integrated view of applications and server/network resources they depend on\n* e.g. drop in Memcache throughput: misconfigured client, swapping, packet loss\n* standardizing metrics breaks the dependency between agents and tools\n* consistent reporting across analysis tools\n* consistent metrics across agents: e.g. web statistics from Apache, Tomcat or NGINX\n* Dave will describe Tagged’s experiences with deploying and using sFlow\n
  20. * Cisco switches/routers with Netflow\n* Some SNMP done by polling, but polling for metrics sucks\n\n\n
  21. * Questions about this diagram or your own diagram, find me after, be happy to go over it with you.\n* integration with later versions of Ganglia, scale ganglia normally\n* deployed via Puppet, all ERB templates fed by CMDB\n* can send data to as many places as you want, can send it to a collector and then into a message bus like Kafka, db, whatever.\n* UDP joke\n
  22. * Our first example is with HTTP. \n* HTTP can be from anything that speaks HTTP, Nginx, Node.js, Tomcat, Apache same metrics. \n* tool that consumes your HTTP metrics, write it once,standard, repeatable, information flow even if you switch from Apache to NGINX. \n* No text log parsing, all streamed in realtime to you \n
  23. * entire stack, cpu, network, application \n* I/O wait on CPU for storage, fronted by CDN.\n* can see network traffic and HTTP metrics \n* some 404s, banned content?\n* static assets tier, ALL GET requests, no POSTs.\n\n
  24. * Lots of GETs and POSTs\n* Easy to make rollup graph in Ganglia, few lines JSON\n* Updated every 15 seconds, comprehend, refresh\n\n
  25. * individual URI performance\n* top 15 URI paths by Time\n* UPLOAD longest, makes sense, upload pictures\n* Pets, most popular game\n* Work with devs, faster pages, more revenue, happier pointy haired people\n* DevOps collaboration\n \n
  26. * Not just duration, ops/sec, bytes/sec\n* graphs on bottom updated every minute\n* can see prevalence of URIs in graph, can even click in this tool\n\n
  27. * previously only STATS command\n* STATS SIZES locks entire cache\n* non-invasive granular instrumentation used to require Gear6 Advanced Reporter\n* Some memcache patches, get streamed to us, hits, misses, etc, also indiv. keys\n\n\n\n
  28. * cold cache ramp, top to bottom view of instances\n* could have CPU, file desc, whatever\n* Cache rapidly achieves steady state\n* GET hits rapidly overwhelm GET misses\n\n
  29. Numbers in legend\n * 6 instances\n * throughput on startup\n * after steady state orders mag more read\n * saving database\n * what memcache adds to your db\n
  30. * Not just metrics from STATS command, actual data\n* # ops/sec on top 15 hottest keys\n* Just like HTTP, durations, throughput as well\n* look at MISSES, try to explain\n\n\n
  31. * Anyone monitor Java apps?\n* Tomcat example, could be any Java process\n* Used to be jstat -gc or poller \n* Drop in JAR, restart, visible in Ganglia or wherever\n* Elasticsearch and Logstash (jruby)\n* drop in visibility \n
  32. * used to get nagios alerts every few days\n* heap builds over days\n\n
  33. * Not just heap, easy to make rollup of any metrics with some JSON \n
  34. * “Has anyone here used TCPdump?”\n* “How many people know Perl, Ruby, Python?”\n* Have all the tools you need to utilize sflowtool\n\n
  35. * take raw data from network, do what you want\n* familiar if used tcpdump\n* reads from network, presents in human or computer consumable\n* get data you see in ganglia charts, plus URLs, memcache keys, etc.\n\n
  36. * drop data into mongodb like you can see on my GitHub account or a CEP like Esper, write an input plugin for logstash, up to you\n* aggregate and send to statsd or graphite? No problem\n* the example building block good tools give you to allow you to do what YOU imagine\n* would encourage you to join the community and take advantage\n\n
  37. \n
  38. \n