Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

R - what do the numbers mean? #RStats

99 views

Published on

R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples

Published in: Technology
  • Be the first to comment

  • Be the first to like this

R - what do the numbers mean? #RStats

  1. 1. R and AI: what do the numbers mean? Speaker Name Job Title, Organization Level: Intermediate
  2. 2. JenStirrup • Boutique Consultancy Owner of Data Relish • Postgraduate degrees in Artificial Intelligence and Cognitive Science • Twenty year career in industry • Author JenStirrup.com DataRelish.com
  3. 3. Get in touch! • http://bit.ly/JenStirrupRD • http://bit.ly/JenStirrupLinkedIn • http://bit.ly/JenStirrupMVP • http://bit.ly/JenStirrupTwitter
  4. 4. Let your data surprise you!
  5. 5. AutoML How do you know if your results are correct?
  6. 6. AutoML Demo
  7. 7. What does Anscombe’s Quartet look like? 8
  8. 8. Looks good, doesn’t it? 9
  9. 9. So, it is correct? 1 0
  10. 10. Correlation r = 0.96 1 1 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Number of people who died by becoming tangled in their bedsheets Deaths (US) (CDC) 327 456 509 497 596 573 661 741 809 717 Total revenue generated by skiing facilities (US) Dollars in millions (US Census) 1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
  11. 11. Why R? • most widely used data analysis software - used by 2M + data scientist, statisticians and analysts • Most powerful statistical programming language • flexible, extensible and comprehensive for productivity • Create beautiful and unique data visualisations - as seen in New York Times, Twitter and Flowing Data • Thriving open-source community - leading edge of analytics research • Fills the talent gap - new graduates prefer R. 1 2
  12. 12. What are we testing? • We have one or two samples and a hypothesis, which may be true or false. • The NULL hypothesis – nothing happened. • The Alternative hypothesis – something did happen. 1 3
  13. 13. Strategy • We set out to prove that something did happen. • We look at the distribution of the data. • We choose a test statistic • We look at the p value 1 4
  14. 14. What do I need to install? • Install R – www.r-project.org • Install Rstudio – www.rstudio.com • AzureML • AutoML 15
  15. 15. “Every American should have above average income, and my Administration is going to see they get it.” (Bill Clinton on campaign trail) “It’s clearly a budget. It’s got lots of numbers in it.” (George W. Bush)
  16. 16. The Guinness Overall Enjoyment Score
  17. 17. William Sealy Gossett
  18. 18. What does the t-test give us? • The t-test helps us to work out whether two sets of data are actually different. • It takes two sets of data, and calculates the mean, the variance and standard deviation
  19. 19. What does the t-test give us? • Then it does a more sophisticated test to tell us if those two means of those two populations are different.
  20. 20. Enter the t-test • The t-test: simple way of establishing whether there are significant differences between two groups of data. • The lower the p value, the more likely that there is a difference in the two groups • We want the probability to be less than 5% to show a difference between two groups.0 10 20 30 40 50 60 70 80 Ireland Elsewhere Sample Size Mean StdDev
  21. 21. The Results! • Using the averages, researchers concluded that Guinness served in Ireland is significantly better than pints served elsewhere.
  22. 22. Summary • The t-test is a valuable tool for showing differences or similarities between groups. • It has been used here to identify whether Guinness is better in Ireland or outside of Ireland.
  23. 23. Business and Statistics? Why? • Statistical analysis is used widely in businesses • Marketing – customer classification, spending patterns • Management consulting – efficient use of resources25
  24. 24. Statistically Significant • If you have significant result, it means that your results likely did not happen by chance. • If you don’t have statistically significant results, you throw your test data out (as it doesn’t show anything!); in other words, you can’t reject the null hypothesis.
  25. 25. Numerical Measures – what is interesting? • Centre of the data • Spread of the data 2 7
  26. 26. Measures of Central Tendency • Mean – this is the average • Median – splits the data in two halves • Mode – the most popular value 2 9
  27. 27. Measures of Dispersion • Variance – average squared difference between the data points and the mean • Standard Deviation – square root of the variance, more intuitive 3 0
  28. 28. Measures of Dispersion • Percentiles – dataset is divided into 100 equal parts • Quartiles – dataset is divided into four equal parts • Interquartile range – middle 50% of data points 3 1
  29. 29. Measures of Association • Covariance – how variables vary together, rise together, fall together • Correlation – very similar, shown between -1 and 1 3 2
  30. 30. Measuring Uncertainty • Probability is based on SETS, which we use in SQL • We determine the probability of outcomes: – Addition Rule – Multiplication Rule – Complement Rule 3 3
  31. 31. Probability Distributions • Binomial distribution – one of two outcomes • Geometric Distribution – probability before success results • Poisson Distribution – probability that a number of events will occur within a time frame • Uniform Distribution – evenly distributed variables • Normal Distribution – bell shaped curve 3 4
  32. 32. Statistical Inference • Process of drawing conclusions about a population of randomly drawn samples 35
  33. 33. Linear Regression • We use sample data to work out the strength and direction of a relationship between two variables.
  34. 34. Linear Regression • The formula works out the • X: predictor variable, also known as the independent variable • Y: response variable, also known as the dependent variable • Lm( y ~ x, data= dataframe)
  35. 35. First Impressions? • How do you go about it? • Check the plot first; how does it look?
  36. 36. What tools do we have in R? • In data wrangling, what are the main tasks? • – Filtering rows – Selecting columns of data – Adding new variables – Sorting – Aggregating 39
  37. 37. What tools do we have in R? • 80% of your time will be spent preparing and wrangling data • The remainder of your time will be spent complaining about it. 40
  38. 38. Plotted Example Data
  39. 39. Plotted Example Data
  40. 40. Multiple Regression In simple linear regression, a criterion variable is predicted from one predictor variable. In multiple regression, the criterion is predicted by two or more variables.
  41. 41. Residuals
  42. 42. Interpreting our Results
  43. 43. Evaluate Model • Receiver Operator Characteristic (ROC) curves • Precision/Recall curves • Lift curves
  44. 44. P value • Compare the p-value for the F-test to your significance level. – If the p-value is less than the significance level, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables.
  45. 45. F-Test • An F statistic is a value you get when you run an ANOVA test or a regression analysis to find out if the means between two populations are significantly different.
  46. 46. F-Test • A-T test will tell you if a single variable is statistically significant and an F test will tell you if a group of variables are jointly significant.
  47. 47. The F-Test • If none of the variables are significant, then the overall F-test is not significant. – It’s an early test so you can throw the model out. • The F-Test can show if the variables are jointly significant • F-test sums the predictive power of all variables
  48. 48. RMSE • RMSE measures how accurately the model predicts the response. • It is the most important criterion for model fit if the main purpose of the model is prediction.
  49. 49. Model validation - probability • Most of the model validation centers around the residuals (essentially the distance of the data points from the fitted regression line) 54
  50. 50. Model validation – Q-Q • Quantile-Quantile plots help evaluate the fit of sample data to the normal distribution. Is the data close to being normally distributed, or are there a lot of outliers, for example? 55
  51. 51. How do you interpret the results? Scale-Location Plot • The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. • We are not hoping to see an obvious trend in this plot.
  52. 52. How do you interpret the results? Importance of each Point • Cook’s Distance – Measure of the importance of each observation to the regression – Distances larger than 1 are suspicious – Outlier 57
  53. 53. Thank you! @jenstirrup
  54. 54. JenStirrup • Boutique Consultancy Owner of Data Relish • Postgraduate degrees in Artificial Intelligence and Cognitive Science • Twenty year career in industry • Author JenStirrup.com DataRelish.com
  55. 55. Get in touch! • http://bit.ly/JenStirrupRD • http://bit.ly/JenStirrupLinkedIn • http://bit.ly/JenStirrupMVP • http://bit.ly/JenStirrupTwitter
  56. 56. Let your data surprise you!
  57. 57. References and Thanks • R and Data Mining: Examples and Case Studies by Yanchang Zhao 62

×