Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Statistics for data scientists


Published on

a humble collection of statistics needed for data scienttists

Published in: Data & Analytics

Statistics for data scientists

  1. 1. Statistics for Data Scientists
  2. 2. Agenda Revision Data Statistics -Descriptive, Central Tendency, Variation, Distributions Data Mining
  3. 3. Basics of Data Science the culture of academia, which does not reward researchers for understanding technology. DANGER ZONE- this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker. data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science
  4. 4. What is Business Analytics Definition – study of business data using statistical techniques and programming for creating decision support and insights for achieving business goals Predictive- To predict the future. Descriptive- To describe the past.
  5. 5. Data Data is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews. data is collected by a huge range of organizations and institutions, including businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. Data is distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs . Programs are collections of instructions for manipulating data.Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind.
  6. 6. Data Definition of data in English: data noun [mass noun] Facts and statistics collected together for reference or analysis: ‘there is very little data available’ The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.
  7. 7. Variable Something that varies
  8. 8. Variable Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit). Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. a distance of ten metres is twice the distance of 5 metres. .
  9. 9. Central Tendency Mean Arithmetic Mean- the sum of the values divided by the number of values. The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth. Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal Mode- The "mode" is the value that occurs most often.
  10. 10. Dispersion Range the range of a set of data is the difference between the largest and smallest values. Variance mean of squares of differences of values from mean Standard Deviation square root of its variance Frequency a frequency distribution is a table that displays the frequency of various outcomes in a sample.
  11. 11. Distribution The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group.
  12. 12. Distributions Normal The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
  13. 13. Skewed Distribution
  14. 14. Skewed Distribution skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. Image ile:Negative_and_positive_sk ew_diagrams_(English).svg
  15. 15. Skewed Distribution kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. kurtosis is a descriptor of the shape of a probability distribution Image section3/eda35b.htm
  16. 16. Skewed Distribution skewness returns value of skewness, kurtosis returns value of kurtosis, web/packages/moments /moments.pdf Image http://www.janzengroup. net/stats/lessons/descrip tive.html
  17. 17. Distributions Bernoulli Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It can be used, for example, to represent the toss of a coin
  18. 18. Distributions Chi Square the distribution of a sum of the squares of k independent standard normal random variables.
  19. 19. Distributions Poisson a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event
  20. 20. Probability Probability Distribution The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.
  21. 21. Refresher in Statistics
  22. 22. Using RCmdr for Statistics
  23. 23. Using RCmdr for Statistics
  24. 24. Using RCmdr for Statistics
  25. 25. Using RCmdr
  26. 26. Central Limit Theorem Central Limit Theorem - In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
  27. 27. Hypothesis testing Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps. 1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation). 2. Identify a test statistic that can be used to assess the truth of the null hypothesis. 3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis. 4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid.
  28. 28. Hypothesis testing
  29. 29.
  30. 30. Hypothesis testing
  31. 31. Hypothesis testing
  32. 32. Hypothesis testing
  33. 33. T test > x = rnorm(10) > y = rnorm(10) > t.test(x,y) > ttest = t.test(x,y) > names(ttest) > ttest$statistic
  34. 34. Chi Square Distribution Problem Find the 95th percentile of the Chi-Squared distribution with 7 degrees of freedom. Solution We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95. > qchisq(.95, df=7) # 7 degrees of freedom [1] 14.067
  35. 35. Normal Distribution we are looking for the percentage of students scoring higher than 84 , we apply the function pnorm of the normal distribution with mean 72 and standard deviation 15.2. We are interested in the upper tail of the normal distribution. > pnorm(84, mean=72, sd=15.2, lower.tail=FALSE) [1] 0.21492
  36. 36. Student T Distribution Problem Find the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom. Solution We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975. > qt(c(.025, .975), df=5) # 5 degrees of freedom [1] -2.5706 2.5706
  37. 37. Some code
  38. 38. Some code
  39. 39. Bayes Theorem
  40. 40. Bayes Theorem'_theorem