Successfully reported this slideshow.
Upcoming SlideShare
×

# The Green Lab - [07-A] Data Analysis

1,487 views

Published on

This presentation is about a lecture I gave within the "Green Lab" course of the Computer Science master, Software Engineering and Green IT track of the Vrije Universiteit Amsterdam: http://masters.vu.nl/en/programmes/computer-science-software-engineering-green-it/index.aspx
http://www.procaccianti.me

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### The Green Lab - [07-A] Data Analysis

1. 1. 1 Het begint met een idee Data Analysis Descriptive Statistics and EDA Giuseppe Procaccianti
2. 2. Vrije Universiteit Amsterdam 2 Giuseppe Procaccianti / S2 group / The Green Lab Quick Recap Experiment scoping Experiment planning Idea Experiment operation Analysis & interpretation Presentation & package
3. 3. Vrije Universiteit Amsterdam 3 Giuseppe Procaccianti / S2 group / The Green Lab Analysis and Interpretation ● Understanding the data ○ descriptive statistics ○ exploratory data analysis (EDA, e.g. boxplots, scatter plots) ● (Optional) data reduction ● Hypothesis testing ● Results interpretation
4. 4. Vrije Universiteit Amsterdam 4 Giuseppe Procaccianti / S2 group / The Green Lab Descriptive Statistics ● Goal: get a ‘feeling’ about how data is distributed ● Properties: ○ Central Tendency (e.g. Mean, Median) ○ Dispersion (e.g. Frequency, Standard Deviation) ○ Dependency (e.g. Correlation)
5. 5. Vrije Universiteit Amsterdam 5 Giuseppe Procaccianti / S2 group / The Green Lab Parameter vs. statistic ● Parameter: feature of the population ○ μ: mean ○ σ: standard deviation ● Statistic: feature of the sample ○ : mean ○ s: standard deviation ● Statistics are an estimation of parameters
6. 6. Vrije Universiteit Amsterdam 6 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Arithmetic mean: ● Geometric Mean:
7. 7. Vrije Universiteit Amsterdam 7 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency: example ● Average of scores: 6 - 7 - 8 - 9 - 10 ● Arithmetic mean: 8 ● Geometric mean: ~7.87
8. 8. Vrije Universiteit Amsterdam 8 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency: example ● Average of returns of investments: 90% ; 10% ; 20% ; 30% ; -90% ● Arithmetic mean: (90+10+20+30-90)/5= 12% ● Geometric mean: [(1.9 x 1.1 x 1.2 x 1.3 x 0.1) ^ 1/5] - 1 =0.2008= -20.08%
9. 9. Vrije Universiteit Amsterdam 9 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Median (or 50% percentile): middle value separating the greater and lesser halves of a data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Xsort = [13, 13, 13, 13, 14, 14, 16, 18, 21]
10. 10. Vrije Universiteit Amsterdam 10 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency ● Mode: most frequent value in data set X = [13, 18, 13, 14, 13, 16, 14, 21, 13] Mox = 13
11. 11. Vrije Universiteit Amsterdam 11 Giuseppe Procaccianti / S2 group / The Green Lab Central Tendency - Skewness
12. 12. Vrije Universiteit Amsterdam 12 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion ● Sample variance: ● Standard Deviation: ● Standard Deviation is dimensionally equivalent to the data
13. 13. Vrije Universiteit Amsterdam 13 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - three-sigma-rule "Empirical Rule" by Dan Kernler - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Empirical_Rule.PNG#/media/File:Empirical_Rule.PNG
14. 14. Vrije Universiteit Amsterdam 14 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - three-sigma-rule ● Range: ● Coefficient of variation: (in percentage of mean) ● Coefficient of variation only has meaning if all values are positive (ratio scale, not interval scale e.g. temperatures)
15. 15. Vrije Universiteit Amsterdam 15 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [100, 100, 100] Mean: 100 ● Variance: 0 ● Standard Deviation: 0 ● Coeff. Variation: 0 ● Range: 0
16. 16. Vrije Universiteit Amsterdam 16 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [90, 100, 110] Mean: 100 ● Sample Variance: 100 ● Standard Deviation: 10 ● Coeff. Variation: 10% ● Range: 20
17. 17. Vrije Universiteit Amsterdam 17 Giuseppe Procaccianti / S2 group / The Green Lab Dispersion - example ● Dataset: [1, 5, 6, 8, 10, 40, 65, 88] Mean: 27.875 ● Sample Variance: 1082.69 ● Standard Deviation: 32.9 ● Coeff. Variation: 1.18% ● Range: 87
18. 18. Vrije Universiteit Amsterdam 18 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot Median 3rd quartile 1st quartile
19. 19. Vrije Universiteit Amsterdam 19 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot
20. 20. Vrije Universiteit Amsterdam 20 Giuseppe Procaccianti / S2 group / The Green Lab Basic visualizations Box Plot By Gbdivers (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons outliers positive skewness
21. 21. Vrije Universiteit Amsterdam 21 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: correlation ● Sample correlation coefficient (Pearson): ● Meaningful when comparing paired values/datasets
22. 22. Vrije Universiteit Amsterdam 22 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: correlation ● Spearman’s rank correlation coefficient: ● Kendall’s rank correlation coefficient: ○ smaller values ○ more accurate on small samples ● Pearson correlation coefficient assumes normally distributed data
23. 23. Vrije Universiteit Amsterdam 23 Giuseppe Procaccianti / S2 group / The Green Lab Dependency: example Age vs. body fat % ● Pearson: r = 0.7921 ● Spearman: = 0.7539 ● Kendall: = 0.5762
24. 24. Vrije Universiteit Amsterdam 24 Giuseppe Procaccianti / S2 group / The Green Lab Basic Visualizations Scatter Plot
25. 25. Vrije Universiteit Amsterdam 25 Giuseppe Procaccianti / S2 group / The Green Lab Basic Visualizations Image Source: http://www.cqeacademy.com/cqe-body-of-knowledge/continuous-improvement/quality-control-tools/the-scatter- plot-linear-regression/ Scatter plots per different values of r
26. 26. Vrije Universiteit Amsterdam 26 Giuseppe Procaccianti / S2 group / The Green Lab Correlation does NOT imply causation! ● Spurious Correlations: http://tylervigen.com/
27. 27. Vrije Universiteit Amsterdam Thank you! g.procaccianti@vu.nl i.malavolta@vu.nl 27 Giuseppe Procaccianti / S2 group / The Green Lab