Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Green Lab - [07-A] Data Analysis

1,500 views

Published on

This presentation is about a lecture I gave within the "Green Lab" course of the Computer Science master, Software Engineering and Green IT track of the Vrije Universiteit Amsterdam: http://masters.vu.nl/en/programmes/computer-science-software-engineering-green-it/index.aspx
http://www.procaccianti.me

Published in: Education
  • Be the first to comment

  • Be the first to like this

The Green Lab - [07-A] Data Analysis

  1. 1. 1 Het begint met een idee Data Analysis Descriptive Statistics and EDA Giuseppe Procaccianti
  2. 2. Vrije Universiteit Amsterdam 2 Giuseppe Procaccianti / S2 group / The Green Lab Quick Recap Experiment scoping Experiment planning Idea Experiment operation Analysis & interpretation Presentation & package
  3. 3. Vrije Universiteit Amsterdam 3 Giuseppe Procaccianti / S2 group / The Green Lab Analysis and Interpretation ● Understanding the data ○ descriptive statistics ○ exploratory data analysis (EDA, e.g. boxplots, scatter plots) ● (Optional) data reduction ● Hypothesis testing ● Results interpretation
  4. 4. Vrije Universiteit Amsterdam 4 Giuseppe Procaccianti / S2 group / The Green Lab Descriptive Statistics ● Goal: get a ‘feeling’ about how data is distributed ● Properties: ○ Central Tendency (e.g. Mean, Median) ○ Dispersion (e.g. Frequency, Standard Deviation) ○ Dependency (e.g. Correlation)
  5. 5. Vrije Universiteit Amsterdam 5 Giuseppe Procaccianti / S2 group / The Green Lab Parameter vs. statistic ● Parameter: feature of the population ○ μ: mean ○ σ: standard deviation ● Statistic: feature of the sample ○ : mean ○ s: standard deviation ● Statistics are an estimation of parameters
  6. 6. Vrije Universiteit Amsterdam 6 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Arithmetic mean: ● Geometric Mean:
  7. 7. Vrije Universiteit Amsterdam 7 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency: example ● Average of scores: 6 - 7 - 8 - 9 - 10 ● Arithmetic mean: 8 ● Geometric mean: ~7.87
  8. 8. Vrije Universiteit Amsterdam 8 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency: example ● Average of returns of investments: 90% ; 10% ; 20% ; 30% ; -90% ● Arithmetic mean: (90+10+20+30-90)/5= 12% ● Geometric mean: [(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
  9. 9. Vrije Universiteit Amsterdam 9 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Median (or 50% percentile): middle value separating the greater and lesser halves of a data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Xsort = [13, 13, 13, 13, 14, 14, 16, 18, 21]
  10. 10. Vrije Universiteit Amsterdam 10 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Mode: most frequent value in data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Mox = 13
  11. 11. Vrije Universiteit Amsterdam 11 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency - Skewness
  12. 12. Vrije Universiteit Amsterdam 12 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion ● Sample variance: ● Standard Deviation: ● Standard Deviation is dimensionally equivalent to the data
  13. 13. Vrije Universiteit Amsterdam 13 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - three-sigma-rule "Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
  14. 14. Vrije Universiteit Amsterdam 14 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - three-sigma-rule ● Range: ● Coefficient of variation: (in percentage of mean) ● Coefficient of variation only has meaning if all values are positive (ratio scale, not interval scale e.g. temperatures)
  15. 15. Vrije Universiteit Amsterdam 15 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [100, 100, 100] Mean: 100 ● Variance: 0 ● Standard Deviation: 0 ● Coeff. Variation: 0 ● Range: 0
  16. 16. Vrije Universiteit Amsterdam 16 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [90, 100, 110] Mean: 100 ● Sample Variance: 100 ● Standard Deviation: 10 ● Coeff. Variation: 10% ● Range: 20
  17. 17. Vrije Universiteit Amsterdam 17 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [1, 5, 6, 8, 10, 40, 65, 88] Mean: 27.875 ● Sample Variance: 1082.69 ● Standard Deviation: 32.9 ● Coeff. Variation: 1.18% ● Range: 87
  18. 18. Vrije Universiteit Amsterdam 18 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot Median 3rd quartile 1st quartile
  19. 19. Vrije Universiteit Amsterdam 19 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot
  20. 20. Vrije Universiteit Amsterdam 20 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons outliers positive skewness
  21. 21. Vrije Universiteit Amsterdam 21 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: correlation ● Sample correlation coefficient (Pearson): ● Meaningful when comparing paired values/datasets
  22. 22. Vrije Universiteit Amsterdam 22 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: correlation ● Spearman’s rank correlation coefficient: ● Kendall’s rank correlation coefficient: ○ smaller values ○ more accurate on small samples ● Pearson correlation coefficient assumes normally distributed data
  23. 23. Vrije Universiteit Amsterdam 23 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: example Age vs. body fat % ● Pearson: r = 0.7921 ● Spearman: = 0.7539 ● Kendall: = 0.5762
  24. 24. Vrije Universiteit Amsterdam 24 Giuseppe Procaccianti / S2 group / The Green Lab Basic Visualizations Scatter Plot
  25. 25. Vrije Universiteit Amsterdam 25 Giuseppe Procaccianti / S2 group / The Green Lab Basic Visualizations Image Source: http://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter- plot-linear-regression/ Scatter plots per different values of r
  26. 26. Vrije Universiteit Amsterdam 26 Giuseppe Procaccianti / S2 group / The Green Lab Correlation does NOT imply causation! ● Spurious Correlations: http://tylervigen.com/
  27. 27. Vrije Universiteit Amsterdam Thank you! g.procaccianti@vu.nl i.malavolta@vu.nl 27 Giuseppe Procaccianti / S2 group / The Green Lab

×