Statistics 101 for System
Administrators
EuroPython 2014, 22th
July - Berlin
Roberto Polli - roberto.polli@babel.it
Babel ...
Who? What? Why?
• Using (and learning) elements of statistics with python.
• Roberto Polli - Community Manager @ Babel.it....
Agenda
• A latency issue: what happened?
• Correlation in 30”
• Combining data
• Plotting time
• modules: scipy, matplotli...
A Latency Issue
• Episodic network latency issues
• Logs traces: message size, #peers, retransimissions
• Do we need to sc...
Basic statistics
Python provides basic statistics, like
from scipy.stats import mean # ¯x
from scipy.stats import std # σX...
Distributions
Data distribution - aka δX - shows event frequency.
# The fastest way to get a
# distribution is
from matplo...
Correlation I
Are two data series X, Y related?
Given ∆xi = xi − ¯x Mr. Pearson answered with this formula
ρ(X, Y ) = i ∆x...
You must (scatter) plot
ρ doesn’t find non-linear correlation!
Intro Roberto Polli - roberto.polli@babel.it
Probability Indicator
Python scipy provides a correlation function, returning two values:
• the ρ correlation coefficient ∈ ...
Combinations
itertools is a gold pot of useful tools.
from itertools import combinations
# returns all possible combinatio...
Netfishing correlation I
# Now we have all the ingredients for
# net-fishing relations between our data!
for (k1,v1), (k2,v...
Netfishing correlation II
Now plot all combinations: there’s more to meet with eyes!
# Plot everything, and insert data in ...
Plotting Correlation
Intro Roberto Polli - roberto.polli@babel.it
Color is the 3rd dimension
from itertools import cycle
colors = cycle("rgb") # use more than 3 colors!
labels = cycle("mor...
Example Correlation
Intro Roberto Polli - roberto.polli@babel.it
Latency Solution
• Latency wasn’t related to packet size or system throughput
• Errors were not related to packet size
• D...
Wrap Up
• Use statistics: it’s easy
• Don’t use ρ to exclude relations
• Plot, Plot, Plot
• Continue collecting results
In...
That’s all folks!
Thank you for the attention!
Roberto Polli - roberto.polli@babel.it
Intro Roberto Polli - roberto.polli@...
Upcoming SlideShare
Loading in...5
×

Statistics 101 for System Administrators

556

Published on

Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple.

The slides shows an example of managing some dataseries for network troubleshooting.

Published in: Software
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
556
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Statistics 101 for System Administrators

  1. 1. Statistics 101 for System Administrators EuroPython 2014, 22th July - Berlin Roberto Polli - roberto.polli@babel.it Babel Srl P.zza S. Benedetto da Norcia, 33 00040, Pomezia (RM) - www.babel.it 22 July 2014 Roberto Polli - roberto.polli@babel.it
  2. 2. Who? What? Why? • Using (and learning) elements of statistics with python. • Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java and Python. Red Hat Certified Engineer and Virtualization Administrator. • Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures based on Open Source software for Italian ISP and PA. Contributes to various FLOSS. Intro Roberto Polli - roberto.polli@babel.it
  3. 3. Agenda • A latency issue: what happened? • Correlation in 30” • Combining data • Plotting time • modules: scipy, matplotlib Intro Roberto Polli - roberto.polli@babel.it
  4. 4. A Latency Issue • Episodic network latency issues • Logs traces: message size, #peers, retransimissions • Do we need to scale? Was a peak problem? Find a rapid answer with python! Intro Roberto Polli - roberto.polli@babel.it
  5. 5. Basic statistics Python provides basic statistics, like from scipy.stats import mean # ¯x from scipy.stats import std # σX T = { ’ts’: (1, 2, 3, .., ), ’late’: (0.12, 6.31, 0.43, .. ), ’peers’: (2313, 2313, 2312, ..),...} print([k, max(X), min(X), mean(X), std(X) ] for k, X in T.items() ]) Intro Roberto Polli - roberto.polli@babel.it
  6. 6. Distributions Data distribution - aka δX - shows event frequency. # The fastest way to get a # distribution is from matplotlib import pyplot as plt freq, bins, _ = plt.hist(T[’late’]) # plt.hist returns a distribution = zip(bins, freq) A ping rtt distribution 158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0 rtt in ms 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ping RTT distribution r Intro Roberto Polli - roberto.polli@babel.it
  7. 7. Correlation I Are two data series X, Y related? Given ∆xi = xi − ¯x Mr. Pearson answered with this formula ρ(X, Y ) = i ∆xi ∆yi i ∆2xi ∆2yi ∈ [−1, +1] (1) ρ identifies if the values of X and Y ‘move’ together on the same line. Intro Roberto Polli - roberto.polli@babel.it
  8. 8. You must (scatter) plot ρ doesn’t find non-linear correlation! Intro Roberto Polli - roberto.polli@babel.it
  9. 9. Probability Indicator Python scipy provides a correlation function, returning two values: • the ρ correlation coefficient ∈ [−1, +1] • the probability that such datasets are produced by uncorrelated systems from scipy.stats.stats import pearsonr # our beloved ρ a, b = range(0, 100), range(0, 400, 4) c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a] correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000 correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683 Intro Roberto Polli - roberto.polli@babel.it
  10. 10. Combinations itertools is a gold pot of useful tools. from itertools import combinations # returns all possible combination of # items grouped by N at a time items = "heart spades clubs diamonds".split() combinations(items, 2) # And now all possible combinations between # dataset fields! combinations(T, 2) Combinating 4 suites, 2 at a time. ♥♠ ♥♣ ♥♦ ♠♣ ♠♦ ♣♦ Intro Roberto Polli - roberto.polli@babel.it
  11. 11. Netfishing correlation I # Now we have all the ingredients for # net-fishing relations between our data! for (k1,v1), (k2,v2) in combinations(T.items(), 2): # Look for correlations between every dataset! corr, prob = pearsonr(v1, v2) if corr > .6: print("Series", k1, k2, "can be correlated", corr) elif prob < 0.05: print("Series", k1, k2, "probability lower than 5%%", prob) Intro Roberto Polli - roberto.polli@babel.it
  12. 12. Netfishing correlation II Now plot all combinations: there’s more to meet with eyes! # Plot everything, and insert data in plots! for (k1,v1), (k2,v2) in combinations(T.items(), 2): corr, prob = pearsonr(v1, v2) plt.scatter(v1, v2) # 3 digit precision on title plt.title("R={:0.3f} P={:0.3f}".format(corr, prob)) plt.xlabel(k1); plt.ylabel(k2) # save and close the plot plt.savefig("{}_{}.png".format(k1, k2)); plt.close() Intro Roberto Polli - roberto.polli@babel.it
  13. 13. Plotting Correlation Intro Roberto Polli - roberto.polli@babel.it
  14. 14. Color is the 3rd dimension from itertools import cycle colors = cycle("rgb") # use more than 3 colors! labels = cycle("morning afternoon night".split()) size = datalen / 3 # 3 colors, right? for (k1,v1), (k2,v2) in combinations(T.items(), 2): [ plt.scatter( t1[i:i+size] , t2[i:i+size], color=next(colors), label=next(labels) ) for i in range(0, datalen, size) ] # set title, save plot & co Intro Roberto Polli - roberto.polli@babel.it
  15. 15. Example Correlation Intro Roberto Polli - roberto.polli@babel.it
  16. 16. Latency Solution • Latency wasn’t related to packet size or system throughput • Errors were not related to packet size • Discovered system throughput Intro Roberto Polli - roberto.polli@babel.it
  17. 17. Wrap Up • Use statistics: it’s easy • Don’t use ρ to exclude relations • Plot, Plot, Plot • Continue collecting results Intro Roberto Polli - roberto.polli@babel.it
  18. 18. That’s all folks! Thank you for the attention! Roberto Polli - roberto.polli@babel.it Intro Roberto Polli - roberto.polli@babel.it
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×