Successfully reported this slideshow.
Upcoming SlideShare
×

# Statistics and Data Mining with Perl Data Language

3,205 views

Published on

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Statistics and Data Mining with Perl Data Language

1. 1. Statistics and Data Miningwith Perl Data LanguageMaggie Xiongmaggie at shutterstock.com
2. 2. PDL::Stats
3. 3. Get Used to Variability
4. 4. Get Used to Variability
5. 5. Get Used to Variability
6. 6. Know Your Data Descriptive statistics Frequency distribution aka histogram E.g., length of words in search queries PDL::Stats::Distr \$data->plot_distr(‘gaussian’)
7. 7. Central Tendency and Spread Central tendency Mean ie average, 5.36 M = (x1 + x2 + x3 + xn) / N Median – the 50th percentile, 5 Mode – the most frequent value, 4 Spread Range, min to max, [1,13] Variance, sum of squared deviations, 3.61 [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N
8. 8. Abstraction of the Frequency Distribution Mean Variance -> standard deviation SD = sqrt( variance ) Normalized score (z score) z = (x – M) / SD z score and probability E.g., client file sizes
9. 9. Inferential Statistics Sample and population The amount of chocolate required to finish a task Sample 1 mean: pdl(4,2,8,4)->avg; # 4.5 Sample 2 mean: pdl(4,8,6,6)->avg; # 6 Sampling distribution of the mean Standard error: standard deviation of the means SE = standard_deviation / sqrt(N) pdl([4,2,8,4],[4,8,6,6])->se; # [1.2583, 0.8165] How do we know if the difference is between two samples from the same population or between two populations?
10. 10. Hypothesis Testing The null hypothesis – H0 Difficult to assume that the means are different and try to confirm it. Instead, assume that the means are not different and see if we have evidence to reject that assumption. H0 – there is no real difference between the means Estimate the probability of observing such a difference if the means come from the same population. Reject H0 if p < 0.05, ie. accept that the means are different z = (M1 – M2) / SE = (6 – 4.5) / 1.5 = 1 p = 2 * (1 – gsl_cdf_gaussian_P(abs(z),1)) = 0.317 Actually we use t- instead of z-distribution for p values. t_test(pdl(4,8,6,6), pdl(4,2,8,4)) # 1 6
11. 11. A/B Test Continuous vs. nominal scale Binomial distribution Two proportion z test SE = P(1 - P)(1/N1 + 1/N2) P = (x1 + x2) / (N1 + N2)
12. 12. Relationship between Variables Pearson correlation (r) Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N r = COV / SDx * SDy r → [-1, 1] length(kw) and kw [search_count, download_count, lightbox_count] [-0.09, 0.10, 0.07] download_count and lightbox_count: 0.92 zy = r * zx + ε
13. 13. Linear Regression and Sum of Squares The linear model: Y = A0 + A1X1 + A2X2 + + AnXn Estimate values for parameters A1 .. An using observed X and Y scores. Given new X’s, calculate predicted Y’s with estimated A1 .. An values. RMSE – root mean squared error Standard deviation around predicted scores. 68% of the time the actual score will fall within 1 RMSE around the predicted score. PDL::Stats::GLM ordinary least squares regression ols and ols_t %m = \$y->ols( \$x ) print "\$_t\$m{\$_}n" for (sort keys %m)
14. 14. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2
15. 15. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2 Var = [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N SD = sqrt(Var)
16. 16. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2
17. 17. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror
18. 18. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror R2 = SSmodel / SStotal
19. 19. K-means Cluster Analysis SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 + (xn – Mx) ** 2 + (y1 – My)**2 + (y2 – My)**2 + (yn – My) ** 2 + ...
20. 20. K-means Cluster Analysis SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 + (xcn.n – Mcn.x) ** 2 + (yc1.1 – Mc1.y)**2 + (yc1.2 – Mc1.y)**2 + (ycn.n – Mcn.y) ** 2 + ...