Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
All of Your Network Monitoring
is (probably) Wrong
joe damato
packagecloud.io
greetings
i’m joe
i like computers
i once had a blog
called timetobleed.com
@joedamato
packagecloud.io
@packagecloudio
follow along
blog.packagecloud.io
cognitive load
too much stuff
cognitive load
copy & paste configs
BTW
This is actually part of another talk I’m working on called
Programmers should
get paid more & work
less
anw
cognitive load
ever copy/paste !
conf or tuning settings
you didn’t understand?
(probably)
there’s too much
damn code
similarly...
cognitive load
do you really understand every
single graph you are generating?
(probably not)
there’s too much
damn code
If there’s too much damn code to
configure and tune
what makes you think
you can actually
monitor it?
spoiler: you can’t
(prob. doesn't matter,
more on this later)
claim: the more complex the
system is, the harder it is to monitor
NOTE
complexity != bad
from: https://www.flickr.com/photos/49128298@N04/24290342544/
NOTE
• you want to:
• download and play the cat game on your an Phone (cause
small portable electronic devices)
• while me...
and that’s OK
it doesn’t mean complexity is bad
2 complicated things that aren’t
necessarily bad
from: https://flic.kr/p/aWXpWZ
from: https://flic.kr/p/56XWHr
so, like, you know one thing that’s
p. complicated?
the Linux networking stack
all kiiiiiiiiiiiiiiiiiiiiiiiinda features
• different NICs have different rx and tx queue size limits and defaults
• ether...
all kiiiiiinda bugs
and
basically
literally
actually
really
no docs
https://www.redhat.com/archives/rhl-list/2007-September/msg03735.html
> How can I find out the /proc/net info
>
> eg: softn...
total
# of packets (not including netpoll)
received by the interrupt handler.
There might be some double
counting going on...
Full networking writeup
literally 90 pages
literally everything about linux networking
literally available here:
http://bi...
it’s fine, as long as we are
honest that it’s just reality
[random os] has a better/faster/
leaner/whatever networking stack
than linux
anw
complex
• not necessarily inefficient
• not necessarily bad
• people expect a lot of complicated features
• so theres a lot...
complex
• the bad news…..
• you are supposed to monitor this complicated code
• and then you are supposed to look at some ...
sounds difficult… but it gets better ;)
what if i told you…
a driver bug caused stats
to output incorrectly in
/proc/net/dev?
igb
• driver stats updated via a timer every 2 secs
• reading stats via /proc/net/dev produced stale stats
• but not via e...
only matters if you are
monitoring your network stats
more often than every 2 sec.
maybe you aren’t
because you dont care
(that’s fine and you are prob. right)
but if you do care…
your future status
• you’d need to:
• notice the problem in your graph
• start reading your stats collecting code/plugin
•...
that’s a lot of work
just to monitor bytes tx/rx
greetings!
things that dont exist
• descending order of probability it doesn't exist:
• free open source
• the an singularity
• calor...
but joe, my devops are the literal
strongest and they doesn't afraid of
anything
i’ll just use the ethtool
ethtool
• a command line tool
• uses ioctl system call to talk to network drivers
• not all drivers actually implement the...
what if i told you…
ethtool
• no standardized way of outputting driver stats
• some drivers don’t even implement the interface
• the ones whic...
¡dale, comparemos!
• ec2 vif driver
• ixgbe driver
• igb driver
ec2 vif driver
$ sudo ethtool -S eth0
NIC statistics:
rx_gso_checksum_fixup: 0
ethtool outputs 1 statistic
what even is
rx_gso_checksum_fixup?
joe’s ixgbe driver on an Real Computer
ethtool outputs 377 statistics
NIC statistics:
rx_packets: 9665600259
tx_packets: 12198470686
rx_bytes: 6790400019470
tx_bytes: 2169046666156
rx_pkts_nic...
of those 377….
none of them are:
rx_gso_checksum_fixup
joe’s igb driver on an Real Computer
ethtool outputs 112 statistics
similarly, of those 112….
none of them are:
rx_gso_checksum_fixup
surely 2 intel drivers
will have similar stats
ixgbe diff igb =>
316 diff stats
and it gets better!
some measured in driver
some measured in hw
monitor all the things!!11!!1!
non_eop_descs ???
os2bmc_rx_by_bmc ???
rx_no_dma_resources ???
this is fine
i’ll read the driver source
i’m really good at the kernels
this is fine
case ixgbe_mac_82599EB:
for (i = 0; i < 16; i++)
adapter->hw_rx_no_dma_resources +=
IXGBE_READ_REG(hw, IXGBE_Q...
IXGBE_QPRDC
uh, wat?
similar?
register read
• this driver, like many others, gets some stats from the NIC
• it does this by reading register values
documentation?
so we should be able to find this in
the NIC data sheet………. right?
page 689
success
• so just repeat this process:
• get the driver code
• read it for every stat
• figure out if stat is in software o...
greetings
what if i told you…
some of these stats aren’t
documented in the data
sheet?
so, like, theres nothing you
can do except literally
guess.
you could email the device
manufacturer….
no one cares, joe
• no one cares about NIC level stats
• too low level
• /proc/net/dev works on my computer for tx/rx
• an...
but what do
errors! drops! fifo! frame! compressed!
mean?
/proc/net/dev
OK, so
• all these fields come from the driver
• some are from software, others from the NIC
• some fields are sums of the o...
but what if…
the drivers don’t agree with
each other on what the
individual statistics represent?
in other words..
what if:
driver_stats->rx_missed_errors
means something different for
each driver you ask?
greetings
meaning of driver stats are
not standardized
BTW
stat meanings for a
driver/device can change
over time.
so:
• you need to figure out which NICs are in prod for all boxes
• which firmware versions used on each NIC
• which version...
maybe you don’t care
too low level
you care about protocol level
stats
odd b/c ethtool settings
can eliminate protocol
stack problems
you dont care?
but, w/e
let’s just read protocol
stats from /proc/net/snmp!
/proc/net/snmp
• there’s an RFC !!!!!!!! (rfc 2013)
• the fields are standardized!!!!
• it’s higher level, so i can figure o...
what if i told you…
BUGS
• several cases where counters are incremented in
the wrong place
• several cases where counters double count
BUGS
several cases where counters aren’t
incremented where you might think they
should be
BUGS
/*!
* ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.!
* Reporting ENOBUFS might not be good!
* (it's not tu...
If this is an important statistic for you,
your monitoring might be wrong
so, what does this mean?
so, what does this mean?
• monitoring something requires very
deep understanding
• otherwise your graphs, alerts, etc
migh...
so, what does it mean?
• this is why people build entire businesses around
monitoring networks (or other stuff).
• resist ...
so, what does it mean?
• properly monitoring, setting alerts, etc requires
significant investment
• i.e. not a bash script ...
• nothing is free
• this doesn’t mean that the software is bad
• so plz don't jump to that conclusion
• this is just reali...
and now an aside
time, money, and business
• engineers always think they can solve everything by
writing enough software
• the problem is t...
time, money, and business
• how long would it take you to:
• figure out if all your networking metrics are right
• figure ou...
time, money, and business
https://baremetrics.com/calculator
(add at least 35% overhead to salaries)
time, money, and business
and this is why monitoring all the things
makes no business sense
for most businesses below a ce...
time, money, and business
and this is also why it doesn’t really
matter if these stats are wrong
if these stats actually m...
in conclusion:
• complexity is not necessarily bad
• even simple software is buggy and hard to monitor
correctly
• it all ...
?packagecloud.io
@packagecloudio
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
All of Your Network Monitoring is (probably) Wrong
Upcoming SlideShare
Loading in …5
×

All of Your Network Monitoring is (probably) Wrong

5,416 views

Published on

Monitorama 2016 talk about network monitoring covering topics like network device drivers, ethtool, and some interesting bugs/features.

For more information about monitoring and tuning the entire Linux network stack, see: blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

Published in: Technology

All of Your Network Monitoring is (probably) Wrong

  1. 1. All of Your Network Monitoring is (probably) Wrong joe damato packagecloud.io
  2. 2. greetings
  3. 3. i’m joe i like computers i once had a blog called timetobleed.com @joedamato
  4. 4. packagecloud.io @packagecloudio
  5. 5. follow along blog.packagecloud.io
  6. 6. cognitive load
  7. 7. too much stuff
  8. 8. cognitive load copy & paste configs
  9. 9. BTW This is actually part of another talk I’m working on called Programmers should get paid more & work less
  10. 10. anw
  11. 11. cognitive load ever copy/paste ! conf or tuning settings you didn’t understand?
  12. 12. (probably)
  13. 13. there’s too much damn code
  14. 14. similarly...
  15. 15. cognitive load do you really understand every single graph you are generating?
  16. 16. (probably not)
  17. 17. there’s too much damn code
  18. 18. If there’s too much damn code to configure and tune
  19. 19. what makes you think you can actually monitor it?
  20. 20. spoiler: you can’t
  21. 21. (prob. doesn't matter, more on this later)
  22. 22. claim: the more complex the system is, the harder it is to monitor
  23. 23. NOTE complexity != bad
  24. 24. from: https://www.flickr.com/photos/49128298@N04/24290342544/
  25. 25. NOTE • you want to: • download and play the cat game on your an Phone (cause small portable electronic devices) • while messaging your buds (cause ur lonely) • with in app purchases (cause you need more gold fish) • over a TLS encrypted connection (cause payments) • over a VPN (cause china) • while flying (cause boredom)
  26. 26. and that’s OK it doesn’t mean complexity is bad
  27. 27. 2 complicated things that aren’t necessarily bad
  28. 28. from: https://flic.kr/p/aWXpWZ
  29. 29. from: https://flic.kr/p/56XWHr
  30. 30. so, like, you know one thing that’s p. complicated?
  31. 31. the Linux networking stack
  32. 32. all kiiiiiiiiiiiiiiiiiiiiiiiinda features • different NICs have different rx and tx queue size limits and defaults • ethernet bonding • IRQ modulation, ntuple filtering, …. • RSS, RPS, RFS, aRFS • GRO, GSO, hw accelerated VLAN IDs, timestamping, …. • you are probably using at least 2 protocol stacks (IP and TCP/UDP) • all kiiiiiiiiiiiiiiinda tuning levers and knobs for everything from top to bottom
  33. 33. all kiiiiiinda bugs
  34. 34. and
  35. 35. basically
  36. 36. literally
  37. 37. actually
  38. 38. really
  39. 39. no docs
  40. 40. https://www.redhat.com/archives/rhl-list/2007-September/msg03735.html > How can I find out the /proc/net info > > eg: softnet_stat is for what purpose ! Much of this is only well-documented in the code. Here's an attempt at interpreting softnet_stat [no guarantee that it is correct; read the code!]:
  41. 41. total # of packets (not including netpoll) received by the interrupt handler. There might be some double counting going on [ … ] I think the intention was that these were originally on separate receive paths
  42. 42. Full networking writeup literally 90 pages literally everything about linux networking literally available here: http://bit.ly/linux-networking
  43. 43. it’s fine, as long as we are honest that it’s just reality
  44. 44. [random os] has a better/faster/ leaner/whatever networking stack than linux
  45. 45. anw
  46. 46. complex • not necessarily inefficient • not necessarily bad • people expect a lot of complicated features • so theres a lot of code needed to support all this random stuff you want to do • see also: cat game example
  47. 47. complex • the bad news….. • you are supposed to monitor this complicated code • and then you are supposed to look at some graphs • and then you are supposed to Know The Answer™
  48. 48. sounds difficult… but it gets better ;)
  49. 49. what if i told you…
  50. 50. a driver bug caused stats to output incorrectly in /proc/net/dev?
  51. 51. igb • driver stats updated via a timer every 2 secs • reading stats via /proc/net/dev produced stale stats • but not via ethtool (different code path) • fixed by forcing stat update whenever stats are read • i saw this in production —— did you????
  52. 52. only matters if you are monitoring your network stats more often than every 2 sec.
  53. 53. maybe you aren’t because you dont care (that’s fine and you are prob. right)
  54. 54. but if you do care…
  55. 55. your future status • you’d need to: • notice the problem in your graph • start reading your stats collecting code/plugin • realize the bug is not there • read your driver code • realize the bug is in the code path that /proc/net/dev hits • write a patch to fix it • rebuild the driver and deploy it everywhere
  56. 56. that’s a lot of work just to monitor bytes tx/rx greetings!
  57. 57. things that dont exist • descending order of probability it doesn't exist: • free open source • the an singularity • calorie free chocolate covered bacon • etc
  58. 58. but joe, my devops are the literal strongest and they doesn't afraid of anything
  59. 59. i’ll just use the ethtool
  60. 60. ethtool • a command line tool • uses ioctl system call to talk to network drivers • not all drivers actually implement the interface • and the ones which do, generally, do it differently
  61. 61. what if i told you…
  62. 62. ethtool • no standardized way of outputting driver stats • some drivers don’t even implement the interface • the ones which do use diff field names
  63. 63. ¡dale, comparemos! • ec2 vif driver • ixgbe driver • igb driver
  64. 64. ec2 vif driver $ sudo ethtool -S eth0 NIC statistics: rx_gso_checksum_fixup: 0 ethtool outputs 1 statistic
  65. 65. what even is rx_gso_checksum_fixup?
  66. 66. joe’s ixgbe driver on an Real Computer ethtool outputs 377 statistics
  67. 67. NIC statistics: rx_packets: 9665600259 tx_packets: 12198470686 rx_bytes: 6790400019470 tx_bytes: 2169046666156 rx_pkts_nic: 11107310349 tx_pkts_nic: 12198470686 rx_bytes_nic: 6929982126806 tx_bytes_nic: 2217848697965 lsc_int: 1 tx_busy: 0 non_eop_descs: 1044042523 rx_errors: 0 tx_errors: 0 rx_dropped: 1 tx_dropped: 0 multicast: 7876979 broadcast: 2633 rx_no_buffer_count: 0 collisions: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 hw_rsc_aggregated: 6600573569 hw_rsc_flushed: 5158863479 fdir_match: 175127 fdir_miss: 11098854004 fdir_overflow: 1 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_timeout_count: 0 tx_restart_queue: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 tx_flow_control_xon: 0 rx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_flow_control_xoff: 0 rx_csum_offload_errors: 0 alloc_rx_page_failed: 0 alloc_rx_buff_failed: 0 rx_no_dma_resources: 0 os2bmc_rx_by_bmc: 0 os2bmc_tx_by_bmc: 0 os2bmc_tx_by_host: 0 os2bmc_rx_by_host: 0 fcoe_bad_fccrc: 0 rx_fcoe_dropped: 0 rx_fcoe_packets: 0 rx_fcoe_dwords: 0 fcoe_noddp: 0 fcoe_noddp_ext_buff: 0 tx_fcoe_packets: 0 tx_fcoe_dwords: 0 tx_queue_0_packets: 650250933 tx_queue_0_bytes: 109734794973 tx_queue_1_packets: 734133738 tx_queue_1_bytes: 123318917069 tx_queue_2_packets: 772808083 tx_queue_2_bytes: 131183014063 tx_queue_3_packets: 741428236 tx_queue_3_bytes: 125821603228 tx_queue_4_packets: 692281561 tx_queue_4_bytes: 118278086880 tx_queue_5_packets: 783438226 tx_queue_5_bytes: 133234307795 tx_queue_6_packets: 719335931 tx_queue_6_bytes: 123662184314 tx_queue_7_packets: 668577198 tx_queue_7_bytes: 114915688397 tx_queue_8_packets: 711699909 tx_queue_8_bytes: 122460443627 tx_queue_9_packets: 681741781 tx_queue_9_bytes: 118032356999 tx_queue_10_packets: 585639061 tx_queue_10_bytes: 98009207733 tx_queue_11_packets: 640487443 tx_queue_11_bytes: 107781535416 tx_queue_12_packets: 706304786 tx_queue_12_bytes: 118963058912 tx_queue_13_packets: 716825472 tx_queue_13_bytes: 121032769231 tx_queue_14_packets: 699280537 tx_queue_14_bytes: 118119557225 tx_queue_15_packets: 675274048 tx_queue_15_bytes: 114916452394 tx_queue_16_packets: 123509474 tx_queue_16_bytes: 25473914817 tx_queue_17_packets: 101309066 tx_queue_17_bytes: 23513562050 tx_queue_18_packets: 92291301 tx_queue_18_bytes: 21830243983 tx_queue_19_packets: 87287348 tx_queue_19_bytes: 20887753665 tx_queue_20_packets: 34518707 tx_queue_20_bytes: 9837323388 tx_queue_21_packets: 24009284 tx_queue_21_bytes: 6760172375 tx_queue_22_packets: 23628875 tx_queue_22_bytes: 6707751077 tx_queue_23_packets: 25969617 tx_queue_23_bytes: 7343742932 tx_queue_24_packets: 30112206 tx_queue_24_bytes: 8614816667 tx_queue_25_packets: 28812367 tx_queue_25_bytes: 8186825345 tx_queue_26_packets: 31710307 tx_queue_26_bytes: 9139202059 tx_queue_27_packets: 40835241 tx_queue_27_bytes: 11499713701 tx_queue_28_packets: 39265877 tx_queue_28_bytes: 11045989548 tx_queue_29_packets: 41775414 tx_queue_29_bytes: 11804871879 tx_queue_30_packets: 12497615 tx_queue_30_bytes: 3405490173 tx_queue_31_packets: 11021513 tx_queue_31_bytes: 2659215149 tx_queue_32_packets: 10464342 tx_queue_32_bytes: 2632864135 tx_queue_33_packets: 11341007 tx_queue_33_bytes: 2818638887 tx_queue_34_packets: 12782059 tx_queue_34_bytes: 3307226594 tx_queue_35_packets: 12795212 tx_queue_35_bytes: 3400547658 tx_queue_36_packets: 59272452 tx_queue_36_bytes: 17286517363 tx_queue_37_packets: 85631445 tx_queue_37_bytes: 25126772743 tx_queue_38_packets: 84708817 tx_queue_38_bytes: 24920451495 tx_queue_39_packets: 83763431 tx_queue_39_bytes: 24662854523 tx_queue_40_packets: 0 tx_queue_40_bytes: 0 tx_queue_41_packets: 0 tx_queue_41_bytes: 0 tx_queue_42_packets: 0 tx_queue_42_bytes: 0 tx_queue_43_packets: 0 tx_queue_43_bytes: 0 tx_queue_44_packets: 0 tx_queue_44_bytes: 0 tx_queue_45_packets: 0 tx_queue_45_bytes: 0 tx_queue_46_packets: 0 tx_queue_46_bytes: 0 tx_queue_47_packets: 0 tx_queue_47_bytes: 0 tx_queue_48_packets: 0 tx_queue_48_bytes: 0 tx_queue_49_packets: 0 tx_queue_49_bytes: 0 tx_queue_50_packets: 0 tx_queue_50_bytes: 0 tx_queue_51_packets: 0 tx_queue_51_bytes: 0 tx_queue_52_packets: 0 tx_queue_52_bytes: 0 tx_queue_53_packets: 0 tx_queue_53_bytes: 0 tx_queue_54_packets: 0 tx_queue_54_bytes: 0 tx_queue_55_packets: 0 tx_queue_55_bytes: 0 tx_queue_56_packets: 0 tx_queue_56_bytes: 0 tx_queue_57_packets: 0 tx_queue_57_bytes: 0 tx_queue_58_packets: 0 tx_queue_58_bytes: 0 tx_queue_59_packets: 0 tx_queue_59_bytes: 0 tx_queue_60_packets: 0 tx_queue_60_bytes: 0 tx_queue_61_packets: 0 tx_queue_61_bytes: 0 tx_queue_62_packets: 0 tx_queue_62_bytes: 0 tx_queue_63_packets: 0 tx_queue_63_bytes: 0 tx_queue_64_packets: 0 tx_queue_64_bytes: 0 tx_queue_65_packets: 0 tx_queue_65_bytes: 0 tx_queue_66_packets: 0 tx_queue_66_bytes: 0 tx_queue_67_packets: 0 tx_queue_67_bytes: 0 tx_queue_68_packets: 0 tx_queue_68_bytes: 0 tx_queue_69_packets: 0 tx_queue_69_bytes: 0 tx_queue_70_packets: 0 tx_queue_70_bytes: 0 tx_queue_71_packets: 0 tx_queue_71_bytes: 0 rx_queue_0_packets: 67 rx_queue_0_bytes: 4688 rx_queue_1_packets: 75 rx_queue_1_bytes: 5523 rx_queue_2_packets: 79 rx_queue_2_bytes: 5987 rx_queue_3_packets: 75 rx_queue_3_bytes: 5631 rx_queue_4_packets: 71 rx_queue_4_bytes: 5273 rx_queue_5_packets: 81
  68. 68. of those 377…. none of them are: rx_gso_checksum_fixup
  69. 69. joe’s igb driver on an Real Computer ethtool outputs 112 statistics
  70. 70. similarly, of those 112…. none of them are: rx_gso_checksum_fixup
  71. 71. surely 2 intel drivers will have similar stats
  72. 72. ixgbe diff igb => 316 diff stats
  73. 73. and it gets better!
  74. 74. some measured in driver some measured in hw
  75. 75. monitor all the things!!11!!1!
  76. 76. non_eop_descs ??? os2bmc_rx_by_bmc ??? rx_no_dma_resources ???
  77. 77. this is fine i’ll read the driver source i’m really good at the kernels
  78. 78. this is fine case ixgbe_mac_82599EB: for (i = 0; i < 16; i++) adapter->hw_rx_no_dma_resources += IXGBE_READ_REG(hw, IXGBE_QPRDC(i));
  79. 79. IXGBE_QPRDC uh, wat?
  80. 80. similar?
  81. 81. register read • this driver, like many others, gets some stats from the NIC • it does this by reading register values
  82. 82. documentation? so we should be able to find this in the NIC data sheet………. right?
  83. 83. page 689
  84. 84. success • so just repeat this process: • get the driver code • read it for every stat • figure out if stat is in software or hardware • if its in software read the driver and figure out what it means • if its in hardware find the data sheet and figure out what it means • then graph it • and then figure out what the graph means
  85. 85. greetings
  86. 86. what if i told you…
  87. 87. some of these stats aren’t documented in the data sheet?
  88. 88. so, like, theres nothing you can do except literally guess.
  89. 89. you could email the device manufacturer….
  90. 90. no one cares, joe • no one cares about NIC level stats • too low level • /proc/net/dev works on my computer for tx/rx • and it has high level summaries • errors! drops! fifo! frame! compressed!
  91. 91. but what do errors! drops! fifo! frame! compressed! mean?
  92. 92. /proc/net/dev
  93. 93. OK, so • all these fields come from the driver • some are from software, others from the NIC • some fields are sums of the other fields • this reduces your data sheet search space • just search for the fields you care about
  94. 94. but what if… the drivers don’t agree with each other on what the individual statistics represent?
  95. 95. in other words.. what if: driver_stats->rx_missed_errors means something different for each driver you ask?
  96. 96. greetings
  97. 97. meaning of driver stats are not standardized
  98. 98. BTW
  99. 99. stat meanings for a driver/device can change over time.
  100. 100. so: • you need to figure out which NICs are in prod for all boxes • which firmware versions used on each NIC • which versions of drivers used for each NIC • read the all driver sources for the fields you care about • read the data sheet to figure out what the fields mean • build An Collectd plugin (or w/e) to encapsulate this knowledge
  101. 101. maybe you don’t care too low level you care about protocol level stats
  102. 102. odd b/c ethtool settings can eliminate protocol stack problems you dont care?
  103. 103. but, w/e
  104. 104. let’s just read protocol stats from /proc/net/snmp!
  105. 105. /proc/net/snmp • there’s an RFC !!!!!!!! (rfc 2013) • the fields are standardized!!!! • it’s higher level, so i can figure out where the protocol layers are breaking down!!! • they are gathered mostly in software • much easier than reading a 1200 pg data sheet
  106. 106. what if i told you…
  107. 107. BUGS • several cases where counters are incremented in the wrong place • several cases where counters double count
  108. 108. BUGS several cases where counters aren’t incremented where you might think they should be
  109. 109. BUGS /*! * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.! * Reporting ENOBUFS might not be good! * (it's not tunable per se), but otherwise! * we don't have a good statistic (IpOutDiscards but it can be too many! * things). We could add another new stat but at least for now that! * seems like overkill.! */! from linux 3.13.0 net/core/udp.c:
  110. 110. If this is an important statistic for you, your monitoring might be wrong
  111. 111. so, what does this mean?
  112. 112. so, what does this mean? • monitoring something requires very deep understanding • otherwise your graphs, alerts, etc might not actually be measuring what you think they are measuring
  113. 113. so, what does it mean? • this is why people build entire businesses around monitoring networks (or other stuff). • resist the urge to think you can solve every problem with a “quick” bash script
  114. 114. so, what does it mean? • properly monitoring, setting alerts, etc requires significant investment • i.e. not a bash script over the weekend • again, not necessarily bad, just important to think about
  115. 115. • nothing is free • this doesn’t mean that the software is bad • so plz don't jump to that conclusion • this is just reality so, what does it mean?
  116. 116. and now an aside
  117. 117. time, money, and business • engineers always think they can solve everything by writing enough software • the problem is that: sometimes spending your time doing that makes no business sense. • other times doing that is actively detrimental
  118. 118. time, money, and business • how long would it take you to: • figure out if all your networking metrics are right • figure out what they all mean • set alerts that are sensible • remember: you need to read a lot of code, data sheets, and potentially several versions of different drivers.
  119. 119. time, money, and business https://baremetrics.com/calculator (add at least 35% overhead to salaries)
  120. 120. time, money, and business and this is why monitoring all the things makes no business sense for most businesses below a certain revenue level
  121. 121. time, money, and business and this is also why it doesn’t really matter if these stats are wrong if these stats actually matter to your business, your business will invest the $$$$ to figure this out
  122. 122. in conclusion: • complexity is not necessarily bad • even simple software is buggy and hard to monitor correctly • it all comes down to value and time • your network monitoring is probably wrong • but it probably doesn't matter because if it did, your company would invest $$$$ in figuring it out
  123. 123. ?packagecloud.io @packagecloudio

×