Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

3,205 views

Published on

Published in:
Technology

No Downloads

Total views

3,205

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

38

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Statistics and Data Miningwith Perl Data LanguageMaggie Xiongmaggie at shutterstock.com
- 2. PDL::Stats
- 3. Get Used to Variability
- 4. Get Used to Variability
- 5. Get Used to Variability
- 6. Know Your Data Descriptive statistics Frequency distribution aka histogram E.g., length of words in search queries PDL::Stats::Distr $data->plot_distr(‘gaussian’)
- 7. Central Tendency and Spread Central tendency Mean ie average, 5.36 M = (x1 + x2 + x3 + xn) / N Median – the 50th percentile, 5 Mode – the most frequent value, 4 Spread Range, min to max, [1,13] Variance, sum of squared deviations, 3.61 [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N
- 8. Abstraction of the Frequency Distribution Mean Variance -> standard deviation SD = sqrt( variance ) Normalized score (z score) z = (x – M) / SD z score and probability E.g., client file sizes
- 9. Inferential Statistics Sample and population The amount of chocolate required to finish a task Sample 1 mean: pdl(4,2,8,4)->avg; # 4.5 Sample 2 mean: pdl(4,8,6,6)->avg; # 6 Sampling distribution of the mean Standard error: standard deviation of the means SE = standard_deviation / sqrt(N) pdl([4,2,8,4],[4,8,6,6])->se; # [1.2583, 0.8165] How do we know if the difference is between two samples from the same population or between two populations?
- 10. Hypothesis Testing The null hypothesis – H0 Difficult to assume that the means are different and try to confirm it. Instead, assume that the means are not different and see if we have evidence to reject that assumption. H0 – there is no real difference between the means Estimate the probability of observing such a difference if the means come from the same population. Reject H0 if p < 0.05, ie. accept that the means are different z = (M1 – M2) / SE = (6 – 4.5) / 1.5 = 1 p = 2 * (1 – gsl_cdf_gaussian_P(abs(z),1)) = 0.317 Actually we use t- instead of z-distribution for p values. t_test(pdl(4,8,6,6), pdl(4,2,8,4)) # 1 6
- 11. A/B Test Continuous vs. nominal scale Binomial distribution Two proportion z test SE = P(1 - P)(1/N1 + 1/N2) P = (x1 + x2) / (N1 + N2)
- 12. Relationship between Variables Pearson correlation (r) Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N r = COV / SDx * SDy r → [-1, 1] length(kw) and kw [search_count, download_count, lightbox_count] [-0.09, 0.10, 0.07] download_count and lightbox_count: 0.92 zy = r * zx + ε
- 13. Linear Regression and Sum of Squares The linear model: Y = A0 + A1X1 + A2X2 + + AnXn Estimate values for parameters A1 .. An using observed X and Y scores. Given new X’s, calculate predicted Y’s with estimated A1 .. An values. RMSE – root mean squared error Standard deviation around predicted scores. 68% of the time the actual score will fall within 1 RMSE around the predicted score. PDL::Stats::GLM ordinary least squares regression ols and ols_t %m = $y->ols( $x ) print "$_t$m{$_}n" for (sort keys %m)
- 14. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2
- 15. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2 Var = [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N SD = sqrt(Var)
- 16. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2
- 17. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror
- 18. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror R2 = SSmodel / SStotal
- 19. K-means Cluster Analysis SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 + (xn – Mx) ** 2 + (y1 – My)**2 + (y2 – My)**2 + (yn – My) ** 2 + ...
- 20. K-means Cluster Analysis SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 + (xcn.n – Mcn.x) ** 2 + (yc1.1 – Mc1.y)**2 + (yc1.2 – Mc1.y)**2 + (ycn.n – Mcn.y) ** 2 + ...

No public clipboards found for this slide

Be the first to comment