Statistics for data scientists

Statistics for
Data Scientists

Agenda
Revision
Data
Statistics -Descriptive, Central Tendency, Variation, Distributions
Data Mining

Basics of Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
the culture of academia, which does not reward researchers for understanding technology.
DANGER ZONE- this overlap of skills gives people the ability to create what appears to be
a legitimate analysis without any understanding of how they got there or
what they have created
Being able to manipulate text files at the command-line,
understanding vectorized operations, thinking algorithmically;
these are the hacking skills that make for a successful data hacker.
data plus math and statistics only gets you machine learning,
which is great if that is what you are interested in, but not if you are doing data science

What is Business Analytics
Definition – study of business data using statistical techniques and
programming for creating decision support and insights for achieving
business goals
Predictive- To predict the future.
Descriptive- To describe the past.

Data
Data is a set of values of qualitative or quantitative variables. An example of qualitative
data would be an anthropologist's handwritten notes about her interviews. data is
collected by a huge range of organizations and institutions, including businesses (e.g.,
sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment
rates, literacy rates) and non-governmental organizations (e.g., censuses of the number
of homeless people by non-profit organizations). Data is measured, collected and
reported, and analyzed, whereupon it can be visualized using graphs, images or other
analysis tools.
https://en.wikipedia.org/wiki/Data
Data is distinct pieces of information, usually formatted in a special way. All software is
divided into two general categories: data and programs . Programs are collections of
instructions for manipulating data.Data can exist in a variety of forms -- as numbers or
text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored
in a person's mind.
http://www.webopedia.com/TERM/D/data.html

Data
https://en.oxforddictionaries.com/definition/data Definition of data in English:
data
noun
[mass noun] Facts and statistics collected together for reference or analysis:
‘there is very little data available’
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted
in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.

Variable
Something that varies

Variable
Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or
ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal
variables are variables that have two or more categories, but which do not have an intrinsic order.
Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a
numerical value (for example, temperature measured in degrees Celsius or Fahrenheit).
Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that
variable. a distance of ten metres is twice the distance of 5 metres.
https://statistics.laerd.com/statistical-guides/types-of-variable.php
.

Central Tendency
Mean
Arithmetic Mean- the sum of the values divided by the number of values.
The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and
not their sum (as is the case with the arithmetic mean) e.g. rates of growth.
Median
the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower
hal
Mode-
The "mode" is the value that occurs most often.

Dispersion
Range
the range of a set of data is the difference between the largest and smallest values.
Variance
mean of squares of differences of values from mean
Standard Deviation
square root of its variance
Frequency
a frequency distribution is a table that displays the frequency of various outcomes in a sample.

Distribution
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of
the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of
individuals in each group.
http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/

Distributions
Normal
The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,

Skewed Distribution
skewness is a measure of
the asymmetry of the
probability distribution of a
real-valued random variable
about its mean. The
skewness value can be
positive or negative, or even
undefined.
Image
https://en.wikipedia.org/wiki/F
ile:Negative_and_positive_sk
ew_diagrams_(English).svg

Skewed Distribution
kurtosis is a measure of the
"tailedness" of the probability distribution
of a real-valued random variable. kurtosis
is a descriptor of the shape of a probability
distribution
Image
http://www.itl.nist.gov/div898/handbook/eda/
section3/eda35b.htm

Skewed Distribution
skewness
returns value of
skewness,
kurtosis
returns value of kurtosis,
https://cran.r-project.org/
web/packages/moments
/moments.pdf
Image
http://www.janzengroup.
net/stats/lessons/descrip
tive.html

Distributions
Bernoulli
Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It
can be used, for example, to represent the toss of a coin

Distributions
Chi Square
the distribution of a sum of the squares of k independent standard normal random variables.

Distributions
Poisson
a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time
and/or space if these events occur with a known average rate and independently of the time since the last event

Probability
Probability Distribution
The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important
continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area
under the curve.

Central Limit Theorem
Central Limit Theorem -
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently
large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will
be approximately normally distributed, regardless of the underlying distribution.

Hypothesis testing
Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The
usual process of hypothesis testing consists of four steps.
1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the
alternative hypothesis (commonly, that the observations show a real effect combined with a component of
chance variation).
2. Identify a test statistic that can be used to assess the truth of the null hypothesis.
3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed
would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the
evidence against the null hypothesis.
4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the
observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is
valid.
http://mathworld.wolfram.com/HypothesisTesting.html

http://cmapskm.ihmc.us/rid=1052458963987_678930513_8647/Hypothesis%20testing.cmap

T test
http://statistics.berkeley.edu/computing/r-t-tests
> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)
> ttest = t.test(x,y)
> names(ttest)
> ttest$statistic

Chi Square Distribution
Problem
Find the 95th
percentile of the Chi-Squared distribution with 7 degrees of freedom.
Solution
We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95.
> qchisq(.95, df=7) # 7 degrees of freedom
[1] 14.067
http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution

Normal Distribution
we are looking for the percentage of students scoring
higher than 84 , we apply the function pnorm of the normal
distribution with mean 72 and standard deviation 15.2. We
are interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492

Student T Distribution
Problem
Find the 2.5th
and 97.5th
percentiles of the Student t distribution with 5 degrees of freedom.
Solution
We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975.
> qt(c(.025, .975), df=5) # 5 degrees of freedom
[1] -2.5706 2.5706

Some code
http://rpubs.com/newajay/stats1

Some code
http://rpubs.com/newajay/stats4

Bayes Theorem
https://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html

Bayes Theorem
https://en.wikipedia.org/wiki/Bayes'_theorem

Statistics for data scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistics for data scientists

Similar to Statistics for data scientists (20)

More from Ajay Ohri

More from Ajay Ohri (20)

Recently uploaded

Recently uploaded (20)

Statistics for data scientists