Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Statistics and Data Miningwith Perl Data LanguageMaggie Xiongmaggie at shutterstock.com
PDL::Stats
Get Used to Variability
Get Used to Variability
Get Used to Variability
Know Your Data Descriptive statistics    Frequency distribution aka histogram        E.g., length of words in search queri...
Central Tendency and Spread Central tendency    Mean ie average, 5.36        M = (x1 + x2 + x3   + xn) / N    Median – the...
Abstraction of the Frequency Distribution Mean Variance -> standard deviation    SD = sqrt( variance ) Normalized score (z...
Inferential Statistics  Sample and population     The amount of chocolate required to finish a task         Sample 1 mean:...
Hypothesis Testing The null hypothesis – H0    Difficult to assume that the means are different and try to confirm it.    ...
A/B Test Continuous vs. nominal scale Binomial distribution   Two proportion z test      SE = P(1 - P)(1/N1 + 1/N2)      P...
Relationship between Variables Pearson correlation (r)    Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N    r = C...
Linear Regression and Sum of Squares The linear model: Y = A0 + A1X1 + A2X2 +             + AnXn    Estimate values for pa...
Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 +   (xn – M) ** 2
Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 +         (xn – M) ** 2    Var = [(x1 – M)**2 + (x2 – M...
Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 +   (yn – Ypred.n) ** 2
Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 +   (yn – Ypred.n) ** 2 SSmodel = SStotal –...
Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 +   (yn – Ypred.n) ** 2 SSmodel = SStotal –...
K-means Cluster Analysis SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 +   (xn – Mx) ** 2        + (y1 – My)**2 + (y2 – My)**2 +  ...
K-means Cluster Analysis SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 +   (xcn.n – Mcn.x) ** 2        + (yc1.1 – Mc1....
Upcoming SlideShare
Loading in …5
×

Statistics and Data Mining with Perl Data Language

3,205 views

Published on

Published in: Technology
  • Be the first to comment

Statistics and Data Mining with Perl Data Language

  1. 1. Statistics and Data Miningwith Perl Data LanguageMaggie Xiongmaggie at shutterstock.com
  2. 2. PDL::Stats
  3. 3. Get Used to Variability
  4. 4. Get Used to Variability
  5. 5. Get Used to Variability
  6. 6. Know Your Data Descriptive statistics Frequency distribution aka histogram E.g., length of words in search queries PDL::Stats::Distr $data->plot_distr(‘gaussian’)
  7. 7. Central Tendency and Spread Central tendency Mean ie average, 5.36 M = (x1 + x2 + x3 + xn) / N Median – the 50th percentile, 5 Mode – the most frequent value, 4 Spread Range, min to max, [1,13] Variance, sum of squared deviations, 3.61 [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N
  8. 8. Abstraction of the Frequency Distribution Mean Variance -> standard deviation SD = sqrt( variance ) Normalized score (z score) z = (x – M) / SD z score and probability E.g., client file sizes
  9. 9. Inferential Statistics Sample and population The amount of chocolate required to finish a task Sample 1 mean: pdl(4,2,8,4)->avg; # 4.5 Sample 2 mean: pdl(4,8,6,6)->avg; # 6 Sampling distribution of the mean Standard error: standard deviation of the means SE = standard_deviation / sqrt(N) pdl([4,2,8,4],[4,8,6,6])->se; # [1.2583, 0.8165] How do we know if the difference is between two samples from the same population or between two populations?
  10. 10. Hypothesis Testing The null hypothesis – H0 Difficult to assume that the means are different and try to confirm it. Instead, assume that the means are not different and see if we have evidence to reject that assumption. H0 – there is no real difference between the means Estimate the probability of observing such a difference if the means come from the same population. Reject H0 if p < 0.05, ie. accept that the means are different z = (M1 – M2) / SE = (6 – 4.5) / 1.5 = 1 p = 2 * (1 – gsl_cdf_gaussian_P(abs(z),1)) = 0.317 Actually we use t- instead of z-distribution for p values. t_test(pdl(4,8,6,6), pdl(4,2,8,4)) # 1 6
  11. 11. A/B Test Continuous vs. nominal scale Binomial distribution Two proportion z test SE = P(1 - P)(1/N1 + 1/N2) P = (x1 + x2) / (N1 + N2)
  12. 12. Relationship between Variables Pearson correlation (r) Covariance = [(x1-Mx)(y1-My) + .. + (xn-Mx)(yn-My)] / N r = COV / SDx * SDy r → [-1, 1] length(kw) and kw [search_count, download_count, lightbox_count] [-0.09, 0.10, 0.07] download_count and lightbox_count: 0.92 zy = r * zx + ε
  13. 13. Linear Regression and Sum of Squares The linear model: Y = A0 + A1X1 + A2X2 + + AnXn Estimate values for parameters A1 .. An using observed X and Y scores. Given new X’s, calculate predicted Y’s with estimated A1 .. An values. RMSE – root mean squared error Standard deviation around predicted scores. 68% of the time the actual score will fall within 1 RMSE around the predicted score. PDL::Stats::GLM ordinary least squares regression ols and ols_t %m = $y->ols( $x ) print "$_t$m{$_}n" for (sort keys %m)
  14. 14. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2
  15. 15. Sum of Squared Deviations (SS) SStotal = (x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2 Var = [(x1 – M)**2 + (x2 – M)**2 + (xn – M) ** 2] / N SD = sqrt(Var)
  16. 16. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2
  17. 17. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror
  18. 18. Sum of Squared Deviations (SS) SSerror = (y1 – Ypred.1)**2 + (y2 – Ypred.2)**2 + (yn – Ypred.n) ** 2 SSmodel = SStotal – SSerror R2 = SSmodel / SStotal
  19. 19. K-means Cluster Analysis SStotal = (x1 – Mx)**2 + (x2 – Mx)**2 + (xn – Mx) ** 2 + (y1 – My)**2 + (y2 – My)**2 + (yn – My) ** 2 + ...
  20. 20. K-means Cluster Analysis SSerror = (xc1.1 – Mc1.x)**2 + (xc1.2 – Mc1.x)**2 + (xcn.n – Mcn.x) ** 2 + (yc1.1 – Mc1.y)**2 + (yc1.2 – Mc1.y)**2 + (ycn.n – Mcn.y) ** 2 + ...

×