The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications
Upcoming SlideShare
Loading in...5
×
 

The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications

on

  • 2,276 views

Velocity 2012 talk on sFlow standard and examples based on Tagged.com's experiences with deploying sFlow to monitor web, Memcache and Java.

Velocity 2012 talk on sFlow standard and examples based on Tagged.com's experiences with deploying sFlow to monitor web, Memcache and Java.

Statistics

Views

Total Views
2,276
Views on SlideShare
2,276
Embed Views
0

Actions

Likes
1
Downloads
52
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • * sounds funny to say “standard”\n* repeatable & consistent, transport and approach, apply to each instrumented protocol\n
  • * met Peter giving a talk on Graphite, needed metrics for Graphs\n* “Who here has networking gear by a vendor not named Cisco? Who here has used sFlow on their switches or Hosts?”\n* my history with sflow\n* super easy to integrate in Graphite talk with a little Perl\n* Tagged fanatical about monitoring, Peter wanted to validate approach so it is a good match (he is also fanatical about sflow), so much so, I asked him if he’d thought about supporting node.js, had it working the next day!\n* You can ask me about the 404s in the hall or at office hours\n\n
  • * been relying on OPEN SOURCE Host sFlow almost 1 year\n1) automatic visibility into applications\n2) network more efficient (PPS)\n* Welcome Peter\n\n
  • * one of the authors of the sFlow standard\n* InMon develops performance management software, \n* contributes to sFlow related projects\n* introduction to sFlow, put context behind examples Dave will present\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * diagram is typical of scale-out, multi-tier “cloud” architectures like Tagged’s\n* server pools ensure high availability and allow capacity to be adjusted to demand\n* size and dynamic nature of cloud architecture makes it a challenge to monitor\n* unusual to show network: often ignored, complexity hidden behind APIs\n* scale-out application performance tightly coupled to network\n* network shared between tiers, can propagate failures\n* a basic problem is lack of network visibility - request timeout, congestion vs. failure\n* network visibility reveals dependencies and congested resources\n
  • * switch vendors embed instrumentation in their hardware\n* sFlow standard was developed by switch vendors to ensure interoperability\n* today, most vendors support sFlow\n* network visibility is a matter of selecting devices with sFlow support\n* recently, sFlow standard extended to include server and application performance\n\n\n
  • * Host sFlow is an open source agent that exports server metrics\n* core of an ecosystem of related open source projects\n* integrate monitoring into an increasing range of applications\n* seen current scope of sFlow implementations\n* let’s take a look at types of measurement that sFlow provides\n
  • * don’t worry - I don’t expect you to read this slide\n* counters are a staple of network and system management \n* counters are maintained by switch hardware, operating systems and applications \n* counters aren’t useful if they are stranded within each device \n* sFlow provides an efficient way to collect counters from large numbers of devices\n* makes performance information actionable\n
  • * sFlow is a simple protocol\n* each of the blocks of counters from previous slide are efficiently encoded using XDR\n* sent as UDP datagrams to an sFlow analyzer\n* each datagram can carry hundreds of individual metrics\n* minimal configuration: IP address of the collector and a polling interval. \n* cloud environments: hosts constantly added, removed, started and stopped\n* challenge: maintaining lists of devices to poll for statistics\n* sFlow: each device automatically sends metrics as soon it starts up\n* devices immediately detected and continuously monitored\n
  • * example: 50 metrics, every 30 seconds from 100,000 servers\n* three thousand sFlow datagrams per second\n* easily decoded and processed\n* storing and querying takes a little more effort\n* easily managed by a single server\n
  • * metrics are extremely useful for characterizing performance\n* operations dashboards covered with trend charts\n* trend chart summarizes vast amounts of information\n* example: chart for link carrying over 14 million packets per second\n* nearly 1 billion packets are summarized in each data point shown on the graph \n* detect a spike - where do you go next?\n
  • * random sampling is an integral part of sFlow monitoring\n* overhead of maintaining one additional counter\n* details of transaction attributes, data volumes, response times and status codes \n* example: 3 network connections showing client, server, protocol and traffic* understand the increase in traffic and plan actions\n* sampling applies equally to HTTP requests, Memcache operations etc.\n* Dave will be presenting additional examples later in this talk\n* stepping back: sFlow allows pervasive instrumentation of the data center\n
  • * embedding instrumentation reduces operational complexity\n* deploys with services\n* ensures all resources continuously monitored\n* integrated view of applications and server/network resources they depend on\n* e.g. drop in Memcache throughput: misconfigured client, swapping, packet loss\n* standardizing metrics breaks the dependency between agents and tools\n* consistent reporting across analysis tools\n* consistent metrics across agents: e.g. web statistics from Apache, Tomcat or NGINX\n* Dave will describe Tagged’s experiences with deploying and using sFlow\n
  • * Cisco switches/routers with Netflow\n* Some SNMP done by polling, but polling for metrics sucks\n\n\n
  • * Questions about this diagram or your own diagram, find me after, be happy to go over it with you.\n* integration with later versions of Ganglia, scale ganglia normally\n* deployed via Puppet, all ERB templates fed by CMDB\n* can send data to as many places as you want, can send it to a collector and then into a message bus like Kafka, db, whatever.\n* UDP joke\n
  • * Our first example is with HTTP. \n* HTTP can be from anything that speaks HTTP, Nginx, Node.js, Tomcat, Apache same metrics. \n* tool that consumes your HTTP metrics, write it once,standard, repeatable, information flow even if you switch from Apache to NGINX. \n* No text log parsing, all streamed in realtime to you \n
  • * entire stack, cpu, network, application \n* I/O wait on CPU for storage, fronted by CDN.\n* can see network traffic and HTTP metrics \n* some 404s, banned content?\n* static assets tier, ALL GET requests, no POSTs.\n\n
  • * Lots of GETs and POSTs\n* Easy to make rollup graph in Ganglia, few lines JSON\n* Updated every 15 seconds, comprehend, refresh\n\n
  • * individual URI performance\n* top 15 URI paths by Time\n* UPLOAD longest, makes sense, upload pictures\n* Pets, most popular game\n* Work with devs, faster pages, more revenue, happier pointy haired people\n* DevOps collaboration\n \n
  • * Not just duration, ops/sec, bytes/sec\n* graphs on bottom updated every minute\n* can see prevalence of URIs in graph, can even click in this tool\n\n
  • * previously only STATS command\n* STATS SIZES locks entire cache\n* non-invasive granular instrumentation used to require Gear6 Advanced Reporter\n* Some memcache patches, get streamed to us, hits, misses, etc, also indiv. keys\n\n\n\n
  • * cold cache ramp, top to bottom view of instances\n* could have CPU, file desc, whatever\n* Cache rapidly achieves steady state\n* GET hits rapidly overwhelm GET misses\n\n
  • Numbers in legend\n * 6 instances\n * throughput on startup\n * after steady state orders mag more read\n * saving database\n * what memcache adds to your db\n
  • * Not just metrics from STATS command, actual data\n* # ops/sec on top 15 hottest keys\n* Just like HTTP, durations, throughput as well\n* look at MISSES, try to explain\n\n\n
  • * Anyone monitor Java apps?\n* Tomcat example, could be any Java process\n* Used to be jstat -gc or poller \n* Drop in JAR, restart, visible in Ganglia or wherever\n* Elasticsearch and Logstash (jruby)\n* drop in visibility \n
  • * used to get nagios alerts every few days\n* heap builds over days\n\n
  • * Not just heap, easy to make rollup of any metrics with some JSON \n
  • * “Has anyone here used TCPdump?”\n* “How many people know Perl, Ruby, Python?”\n* Have all the tools you need to utilize sflowtool\n\n
  • * take raw data from network, do what you want\n* familiar if used tcpdump\n* reads from network, presents in human or computer consumable\n* get data you see in ganglia charts, plus URLs, memcache keys, etc.\n\n
  • * drop data into mongodb like you can see on my GitHub account or a CEP like Esper, write an input plugin for logstash, up to you\n* aggregate and send to statsd or graphite? No problem\n* the example building block good tools give you to allow you to do what YOU imagine\n* would encourage you to join the community and take advantage\n\n
  • \n
  • \n

The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and Applications Presentation Transcript

  • The sFlow standard: scalable, unified monitoring of networks,systems and applications Dave Mangot (Tagged Inc.) tech.mangot.com Peter Phaal (InMon Corp.) blog.sflow.com
  • Tagged Inc. Social Networking 5 billion page views a month 4 TB of main memcached Heavy use of Apache/PHP and Java Ganglia critical to business function Puppet for configuration management
  • InMon Corp. Performance management software developer Originators of the sFlow standard Founding member of sFlow.org Initial implementation and contributor to Host sFlow and related projects - Memcached sFlow patch - Apache mod_sflow - NGINX sFlow module - sFlow Java Agent Contributed sFlow support to Ganglia project
  • Challenge: Monitoring large, scale-out, multi-tiered sites Load Memcache Balancer Server Web Server Balancer Server Network Application Database Server Large number of servers in each pool Servers constantly being added/removed Network performance is critical - scale-out applications dependent on network performance - potential for propagating failures between tiers
  • Challenge: Monitoring large, scale-out, multi-tiered sites Load Memcache Balancer Server Web Server Balancer Server Network Application Database Server Large number of servers in each pool Servers constantly being added/removed Network performance is critical - scale-out applications dependent on network performance - potential for propagating failures between tiers
  • sFlow is the industry standard for monitoring switches
  • Open source sFlow agents for hosts and applications
  • sFlow exports standard counters Network (maintained by hardware in network devices) - MIB-2 ifTable: ifInOctets, ifInUcastPkts, ifInMulticastPkts, ifInBroadcastPkts, ifInDiscards, ifInErrors, ifUnkownProtos, ifOutOctets, ifOutUcastPkts, ifOutMulticastPkts, ifOutBroadcastPkts, ifOutDiscards, ifOutErrors Host (maintained by operating system kernel) - CPU: load_one, load_five, load_fifteen, proc_run, proc_total, cpu_num, cpu_speed, uptime, cpu_user, cpu_nice, cpu_system, cpu_idle, cpu_wio, cpu_intr, cpu_sintr, interupts, contexts - Memory: mem_total, mem_free, mem_shared, mem_buffers, mem_cached, swap_total, swap_free, page_in, page_out, swap_in, swap_out - Disk IO: disk_total, disk_free, part_max_used, reads, bytes_read, read_time, writes, bytes_written, write_time - Network IO: bytes_in, packets_in, errs_in, drops_in, bytes_out, packet_out, errs_out, drops_out Application (maintained by application) - HTTP: method_option_count, method_get_count, method_head_count, method_post_count, method_put_count, method_delete_count, method_trace_count, method_connect_count, method_other_count, status_1xx_count, status_2xx_count, status_3xx_count, status_4xx_count, status_5xx_count, status_other_count - Memcache: cmd_set, cmd_touch, cmd_flush, get_hits, get_misses, delete_hits, delete_misses, incr_hits, incr_misses, decr_hists, decr_misses, cas_hits, cas_misses, cas_badval, auth_cmds, auth_errors, threads, con_yields, listen_disabled_num, curr_connections, rejected_connections, total_connections, connection_structures, evictions, reclaimed, curr_items, total_items, bytes_read, bytes_written, bytes, limit_maxbytes
  • sFlow’s scalable “push” protocol Simple - standard structures - densely packed blocks of counters - extensible (tag, length, value) - RFC 1832: XDR encoded (big endian, quad-aligned, binary) - simple to encode/decode - unicast UDP transport Minimal configuration - collector address - polling interval Cloud friendly - flat, two tier architecture: many embedded agents → central “smart” collector - sFlow agents automatically start sending metrics on startup, automatically discovered - eliminates complexity of maintaining polling daemons (and their associated configurations)
  • Example Collect 50 metrics per server Every 30 seconds From 100,000 servers 100,000 / 30 ≈ 3,333 sFlow datagrams per second
  • Example Collect 50 metrics per server Every 30 seconds From 100,000 servers 100,000 / 30 ≈ 3,333 sFlow datagrams per second Single sFlow analyzer can monitor entire data center!
  • Counters aren’t enough  Counters tell you there is a problem, but not why.  Counters summarize performance by dropping high cardinality attributes: - IP addresses - URLs - Memcache keys  Need to be able to efficiently disaggregate counter by attributes in order to understand root cause of performance problems.  How do you get this data when there are Why the spike in traffic? millions of transactions per second? (100Gbit link carrying 14,000,000 packets/second)
  • sFlow also exports random samples  Random sampling is lightweight - critical path roughly cost of maintaining one counter: if(--skip == 0) sample(); - sampling is easy to distribute among modules, threads, processes without any synchronization - minimal resources required to capture attributes of sampled transactions  Easily identify top keys, connections, clients, servers, URLs etc.  Unbiased results with known accuracy Break out traffic by client, server and port(graph based on samples from100Gbit link carrying 14,000,000 packets/second)
  • Big Picture: Comprehensive, multi-layer visibility Apache/PHP Memcached Applications Tomcat/Java Virtual Servers Virtual Network Servers Network Embedded monitoring of all switches, Consistent measurements shared all servers, all applications, all the time between multiple management tools
  • Tagged Uses sFlow! Apache via mod_sflow Java via sflowagent (-agent sflowagent.jar) Memcached via source patches Host sFlow
  • sFlow + Ganglia make a much better graphic integration with Ganglia deployed via Puppet
  • HTTP response codes (200, 300, 400, etc.) method (GET, HEAD, etc.) URL duration, frequency, bytes
  • View the stack at once
  • Not just GETs!
  • Apache URLs by Duration
  • Slice URLs how YOU want
  • Memcached Hits/Misses Operations (GET, SETS, etc.) Traffic bytes, duration, operations Top Keys
  • Cold Cache Ramp Up
  • Protect the DB
  • Protect the DB
  • Sees the keys
  • Java (e.g. Tomcat) Heap/Non-Heap Utilization File descriptors GC compilation & timings/counts Classes Loaded/Unloaded Threads
  • Reap the Heap
  • Not just heap!
  • TCP +
  • sflowtool Open Source Command Line Understands tcpdump! Output: delimited text
  • Thanks! The Ganglia Team The SiteOps team @ Tagged & Tagged Inc. Bay Area LSPE Meetup - actually meeting tonight! TubeMogul PayPal O’Reilly http://clipart-for-free.blogspot.com/2008/06/free-truck- clipart.html
  • Questions? We are also doing office hours today @ 2:30 in the exhibit hall!