2 ueda

Leonardo Auslender –Ch. 1 Copyright 2004 1.1-29/17/2019
The univariate approach is based on the analysis of:
1. Central tendency: mean, median, mode, counts of
missingness and non-missingness.
2. Dispersion: standard deviation, inter-quartile range, range
3. Distribution: histogram (or density estimate), quantile plot,
cumulative distribution function, table of relative frequencies
4. Exploring variables one by one.
5. Variables can be continuous (interval based) or nominal
(will concentrate on binaries).
6. Statistical Inference (part of all of EDA and modeling).

Quick Definitions that you should know even in your sleep.
i
Variable X, n observations.
1
(X)= X X
n
50th percentile (median):
If n even, Med(X) = average of two central sorted values of X.
If n odd, Med(X) = central value of sorted values of X
Mean
Variance
.
(X) =
= 
2
i
i i
1
Var (X) = (X X)
(n 1)
Standard Deviation (X) std(X) Var(X)
Standard Error of mean (X) std(X) / n
Range Max(X) Min(X)
Median Absolute Deviation (MAD) = med | x - med(x )|
Inter-Quartile Range = 75th percent
−
−
= =
=
= −

ile - 25th percentile
Mode: Most frequent value (more useful for nominal variables).
With so many measures of central tendency and dispersion
a variable is distributed along many values, usually graphe

d ==>
are there distributions that usually resemble or describe them?

Additional Measures
Harmonic mean, used to average rates. Let x1 ,,, xn be positive numbers
(H tends strongly to min (xi, mitigates effect of large outliers ad enhances
effect of small outliers). Used in finance for time series data, e.g., P/E data).
n
i 1 1i
n
1 n 1 n
i 1 i
1 n
1
xn1 1 1H( , , , , , ) ( )
x x avg(x , , , x )1 n
x
x , , , , x 0
= −
=
= = =




Additional Measures
Geometric mean, for any x1,,,,, xn. Used to average growth rates.
n
1 n 1 nG(x ,,,,x ) x ,,,,x
H G Avg
=
 
Example: Let {x} = (1, 2, 3).
H = 1.636 < g = 1.817 < mean = 2
Example: Let {x} = (1, 1/2, 1/3).
H = 0.5 < g = 0.55 < mean = 0.61

Some Basic Definitions: Univariate Distributions.
Skewness: opposite of symmetry.
Measures direction and degree of asymmetry: zero indicates symmetrical
distribution. Positive value indicates right skewness (long-tailedness to the
right) while negative value indicates left skewness.
Perfectly symmetrical,
non-skewed, distribution: mean, median and mode are equal. Positive
skewness: mean > median ➔ most values < mean. And opposite for negative
skewness.
For instance, store sales are typically skewed.
Positive Skewness.

If data distribution is symmetric
mean = median = mode
If positively skewed
mode < median < mean
If negatively skewed
mean < median < mode
Symmetry and measures of central tendency

Heaviness of tails of distribution.
Usual reference point is normal distribution. If b2 = three (g2 zero) and
skewness = zero, distribution is normal. Uni-modal distributions that have
kurtosis > 3 have heavier tails than normal. These same distributions also
tend to have higher peaks in center of distribution. Uni-modal distributions
whose tails are lighter than normal distribution tend to have kurtosis < 3.
In this case, peak of distribution tends to be broader than normal.
2 2
Kurtosis = Peakedness.Two most frequently
used measures, Pearson's b and Fisher's g
4 ,
2 2
2
( 1)( 1) 3( 1)
[ ]
2 2( 2)( 3) 1
m
b
m
n n n
g b
n n n
=
+ − −
= −
− − +

Some Basic Definitions Kurtosis: peakedness.
Excess Kurtosis = leptokurtic:, e.g.,
Stock price returns (Bollerslev-
Hodrick, 1992).
Normal

Homework and Interview Questions.
Since the mean estimation divides by n, why is the variance
divisor (n - 1), and not ‘n’, or (n – 2), or (n + 1) of sqrt (5)? Do
some reading, no need for mathematical proof.
Why do we work ‘squaring’ (e.g., variance) and not straight
absolute values, for instance?

DS1: Study Measures of Central Tendency and dispersion.
Note median = 0 and MAD = 0 for No_claims. Can you explain it?
Basics and Measures of
centralitity
# Nonmiss Obs % Missing Mean Median Mode
Variable
5,960 0.00 8.941 8.000 9.000
DOCTOR_VISITS
MEMBER_DURATION
5,960 0.00 179.615 178.000 180.000
NO_CLAIMS
5,960 0.00 0.406 0.000 0.000
NUM_MEMBERS
5,960 0.00 1.986 2.000 1.000
OPTOM_PRESC
5,960 0.00 1.170 1.000 0.000
TOTAL_SPEND
5,960 0.00 18,607.970 16,300.000 15,000.000
Measures of Dispersion Variance Std Deviation Std of Mean Median Abs Dev Nrmlzd MAD
Variable
52.31 7.23 0.09 5.00 7.41
DOCTOR_VISITS
MEMBER_DURATION
6,736.56 82.08 1.06 57.00 84.51
NO_CLAIMS
1.16 1.08 0.01 0.00 0.00
NUM_MEMBERS
0.99 1.00 0.01 1.00 1.48
OPTOM_PRESC
2.74 1.65 0.02 1.00 1.48
TOTAL_SPEND
125,607,617.29 11,207.48 145.17 6,000.00 8,895.60

Too quick a note on missing values.
In previous slide, no variable has missing values, A VERY RARE
EVENT. Typically, all large data bases have missing values, even
in small percentages.
Since most software operates on ‘full’ rows, i.e., existence of just one
missing point in any variable in the row, missingness propagates
quickly (full detail in MEDA under missing values).
Thus, UEDA can proceed to obtain measures of central tendency and
variation except when missingness is 100% for a specific variable. But
BEDA already can suffer tremendously.
ADVICE: Find out Missings by UEDA first. Then decide whether to impute (see
MEDA section later on) or delete observations (try hard not to). Then, continue your
analysis and even modeling.

proc univariate data = &indata. (keep = &vars. ) CIBASIC
CIPCTLNORMAL
ROBUSTSCALE normal ;
%PUT;
histogram &vars. / normal (color = black w = 7 l = 25)
kernel (k = normal c = 0.2 0.5 0.8 1 color = green w = 5
l = 1)
;
inset nmiss = "# missing" (5.0)
n = "N"
min = "Min" mean = "Mean"
median = "Median" mode = "Mode"
max = "Max" (6.3)
Normal kernel (type)
;
Run;
Quit;
SAS code for next slide of variable distributions: Histograms.

Important note on nominal variables.
Also called categorical variables. Categories may denote our own
values and not realities. Defining races as Black, White, Hispanic, Asian,
etc implies considering Chinese and Indonesian to be the same.
Instead, red and blue color are distinct realities ➔ we use own values to
create these categories.
Problem is we tend to create hypotheses, variables, conditions,
testing environments, and of course conclusions from these
constructs that may be arbitrary. E.g., marketing segment assignment
places you in Hispanic group because you learned Spanish in high
school and speak it somehow, when you’re not Hispanic.
➔ Given our creation of intended hypotheses, categories, and
conditions of data gathering, match representatives of different ‘races’ to
reach conclusions that may be plagued with errors due to category
construction.

Not all Vars shown.

Why transform? Because transformed variables may be skew free
(skewness obviously affects Variance estimation), closer to ‘normality’ (If
needed or wanted), provide rank information. All these issues are strongly
related to statistical inference and modeling. Transformations can be
multi-variate (e.g., Principal components, etc.) as well.
More importantly, if data is used in actual practice in transformed manner,
then transformation may be advisable. E.g., many medical decisions (e.g.,
prostate cancer and PSA) are based on thresholds on PSA counts. Do
not transform while modeling, do transform when reporting and presenting
results.
Continuous variables differ in range (min, max) and spread:
often convenient to homogenize them to the same units for comparison.
Centering: Given variable x, for each observation, subtract the mean
value. Resulting variable has mean 0. Note that distribution is not
centered (i.e., more symmetrical) unless mean = mode.

Standardization: given variable x, for each observation, subtract the
overall mean and divide by the standard deviation. Mean removal is
called ‘centering’, and the standardized variable measures how far
In units of std each obs is from 0, because the new mean is 0 and the
new std is 1.
NOTE: Standardization is not equivalent to normalization.
Log Transformation: apply log (X), for X > 0. If X has negative values
could instead do log (X + min (X) + 0.001). PROBLEM: If modeling
situation to implement method in future data sets, Min (X) of
Original data set may be different from min (x) of future dataset.
Binning: Many different methods. Popular one is divide range into
# of equal sized sub-ranges. # could be 5 or 10 but application
dependent. Method seriously abused.

Comments on Transformations.
Shown transformations are monotonic. A ratio transformation ((1 /
x), x ne 0) is reverse monotonic, as –x is as well.
HW question:
Consider transformations of variable Doctor_visits in previous
slides. Compare them and report.

Probability
Classical (a priori) Definition: If Event E can occur in “k” ways out of total of “n” equally
likely possible ways, then probability p is defined as p = k / n.
Frequency (a posteriori) definition: If event E happens k times after n repetitions, then p = k /
n. Implicitly assumes equally likely and independent repetitions.
Both definitions have serious flaws because they are circular, since “equally likely” is same
as “with equal probability” and probability has not yet been defined. The a-posteriori
definition is deficient because it does not specify required number of repetitions. Still,
definition has intuitive appeal for the case of experiments that are repeatable and for which
number of possible outcomes is available.
Besides, property of symmetry is usually hiding: each outcome is equi-probable, which of
course, raises issue of not having defined probability when we are already using the
notion of equi-probability.
The subjectivist definition of probability: it is the degree of belief in the occurrence of event
E by an individual, which of course, can be different from others’ beliefs.

Axioms of Probability ***
1) For any event E, P(E) >= 0
2) For the certainty event S, P(S) = 1
3) For any two disjoint events, E and non E, we have P (E U non E) = P (E) + P (non E),
and similarly for an infinite sequence of disjoint events.
Theorems on probability spaces ***
1) The impossible event, the empty set ∅ , has probability 0.
2) For any event A, the probability of its complement if P(A complement) = 1 – P(A)
3) For any event A, 0<= P(A) <= 1
4) If A C= B, then P(A) <= P(B)
5) For any two events A and B, P(AB) = P(A) – P(A B)
6) For any two events P (A U B) = P(A) + (PB) – P(A B)
7) For any events A, B, C
P (A U B U C) = P (A) + P (B) + P (C) – P (A B) – P ( A C) – P(B C) + P (A B C)

Beware of unconditional versus conditional probability interpretations.

Example of conditional versus unconditional statements.
(from https://articles.mercola.com/sites/articles/archive/2018/07/31/
effects-of-cellphone-radiation.aspx?utm_source=dnl&utm_medium=email&utm_content=artTest_B1&utm_campaign=
20180725Z1_UCM&et_cid=DM223789&et_rid=375100020)
“Of 326 cellphone safety studies, 56 percent found a biological effect
from cellphone radiation while 44 percent did not. When funding
was analyzed, it was discovered that 67 percent of the funded
studies found a biological effect, compared to just 28 percent
of the industry-funded studies. This funding bias creates a perceived
lack of scientific consensus.”
That is, Pr (Effect) = 56%.
Pr (Effect / No funding) = 67%
Pr (Effect / Funding) = 28%
Notice that while conditional effects involve at least two variables (e.g., effect
and funding), resulting analysis is univariate on Effect.
Application: Conditional probability is basis of Associations Analysis, heavily
used in marketing.

Probability obfuscation. O. J. Simpson’s case.
By law, Pr (Simpson not guilty of killing wife without further evidence to
contrary) = 100% (later on called Null Hypothesis).
Prosecutor argues that defendant battered his wife.
Defense lawyer Dershowitz provides following counterargument about
relevance of battering:
4 million women are battered annually by male companions in US.
In 1992, 1,432, or 1 in 2,500, women were killed by them ➔ women batterers
seldom murder their women.
But, Probability that a man who batters his wife will go on to kill her is not
relevant info.
Rather probability that a battered wife who was murdered was murdered by her
abuser. And that number is 90%. Fact that she was murdered already, should be
part of probability computation.

1 in 2500
battered
Murdered.
Murdered by
Male companions.

Data Distributions.
Univariate distributions take many forms. We focus on some visual attributes, analyzed
in reference to central points and ‘tails’ of the distribution.
The Normal distribution is the most used and abused, with shape:
(https://www.kdnuggets.com/2018/07/explaining-68-95-99-7-rule-normal-distribution.html)
68% within 1 std, 95% within 2 std, 99.7% within 3 std. Distribution of most observations
hover around the mean (= median, = mode). Odds of deviation from this average (or
chances of a value being different from the medium) decline at an increasingly faster
rate (exponentially) as we move away from the average. Tails are thin.
Mean and variance fully define Normal distribution.

Normal Distribution.
When mean = 0 and variance = 1, N becomes standard normal
distribution, denoted N (0, 1).

LAW OF LARGE NUMBERS (LLN, Bernoulli)
As sample size (n) increases, sample mean converges to population
mean, if latter exists. (For instance, sampling from Cauchy distribution does not
converge to anything because Cauchy does not have mean. However, Cauchy
sample medians converge to population median (Blume & Royall 2003)).
In typical casino setting, LLN assures casino that it will not lose any
money as long as bets keep on coming, It does not assure that at
every specific bet, casino will make money.
Conclusion: Either skip the casino, or own one.

Roulette illustration.
There are 36 colored slots, plus 2 non-colored, 0 and 00. if you bet
$1 on slot 20 and you win, you get back a total of $36, for $35 of
profit. Else, you lose $1.
Prob (winning) = 1/ 38, prob (losing) = 37/38.
Expected value: 1/38 (35) – 1 (37/38) = -0.0526. On average, every
time you bet on roulette $1, you lose a bit more than 0.05.
➔Casinos know that LLN is on their side and for large number of
bets, they make 5% on average.
If you play 38 times, on average you win once. That is, you pocket
$36 and paid out $38. Loss = 2 / total spent = 0.0526.
If 1000 people bet this way 35 times, about 610 will have won at
least once (61%), but ‘winners’ earnings will be less than lossers’
losses.

CENTRAL LIMIT THEOREM CLT (simplest form).
Distribution of sum of independent random variables, with common distribution
with finite mean and variance, approaches normal distribution ➔ sample mean
tends to N.
Note: convergence is to distribution, not to specific number.
(from bwgriffin.com/gsu/courses/edur8131/content/EDUR_8131_CLT_illustrated_one_page.pdf)
SAT data: Figures 2 through 9 show histogram, not of raw SAT scores, but of
means from samples of differing sizes. Figure 2, for example, shows means taken
from a sample size of 2. To construct Figure 2, a total of 5,000 samples (n = 2 for
each sample) of SAT scores were taken from those SAT scores displayed In
Figure 1. For Figure 3, another set of 5,000 samples was taken from SAT scores,
but with a sample size of 3. Each successive figure shows distribution of sample
means for varying sample sizes.
Thus, Figures 2 through 9 are histograms of sampling distributions for the mean.
Note that as sample sizes increase, the shape of the sampling distribution of
means approaches a normal curve and looks less and less like the bimodal
distribution of raw SAT scores. This is exactly what the central limit theorem
predicts.

Importance of Normal Distribution.
Many statistical inferential tests (to decide whether product A is
different than B, patient A benefits more from drug A than B, etc.) rely
on normal distribution, mostly via the CLT.
This does not imply that all data sets or individual or groups of
variables are normally distributed. In fact, most variables, individually
or viewed multi-variably, are not normally distributed. For instance,
when modeling linear regressions, usual assumption is that error
component is N (0, 1) and we test this hypothesis on the residuals. In
many other areas, there are tests available to test many hypotheses.
But in many circumstances, it is possible to transform the variables to
normality, and sometimes desirable (un-reviewed in this class).
Height distribution of humans is NOT normal but binomial (due to
gender), no matter sample size.

Bird’s eye view of probability distributions
datasciencecentral.com/profiles/blogs/common-probability-distributions-the-data-scientist-s-crib-sheet

Counting and Probability Homework
1) Toss 5 fair quarters once, H and T. How many possible
outcomes are there?
2) You flip a fair coin 8 times and obtain H H H H H H H T. What’s
the probability of obtaining 1 T and 7 H in 8 flips?
3) Flip a fair coin thrice without noting the result. A friend insists
that the second flip was T. What’s the probability that the first
flip was H?

Statistical Inference – Some Basic Definitions Sample (n) and
population sizes (N) ***
Sample distribution determined by ‘n’, NOT by n/N (sample proportion).
Population observations must be independent of each other. Inference
always on samples, sample determination is huge topic. N = pop. size, n =
sample size.
1) For given N, there are (N choose n) different samples. 2 cases: Cases of N = ∞,
(e.g., universe of coin flips). N < ∞ (attendees at present lecture).
2) For given N, sampling with increasing n generates ‘thinner’ distributions, i.e.,
smaller spread.
3) For constant n/N, and n and N increasing in same proportion, sample spreads
vary dependent on ‘n’.
4) For given n, increasing N does not change much sampling distribution.
5) Most statistical formulae divide by ‘n’, not by N.
6) For case of multivariate samples, e.g., Giga bases, ‘n’ is related to ‘p’ (number of
variables) with some heuristics, based on computer resources, specifics of
analytics and rules of thumb.

Statistical Inference – Some Basic Definitions.
Curse of Dimensionality in high dimensions ***
Consider variables X, Y, Z uniformly distributed U (0, 1). If choose 10% sample
for X alone, expected edge length = .1. I.e., 10% X sample covers 10% of range.
But if want to still have 10% range with X, Y sampling, needs sample proportion
of .1 ** (1 /2) = .32. If X, Y, Z for 10% range, needs 46% sample proportion
(formula = .1 ** (1 / p).
➔ Adding dimensions (i.e., variables) ➔ higher sampling proportion to avoid
sparseness on the data to analyze. In fact, studying further dimensions for
given sample size ➔ true aspects of data (especially in modeling) are not
represented in sampled data.
➔ If n suffices for 1 dimension variable, n **2 required for 2, n ** 3 for 3, etc. But
there are not enough data points in the universe for very large p.
Sampling needs grow exponentially with dimensions.

Statistical Inference – Some Basic Definitions
Probability Intervals for Sample Mean .
Let Z  N(0, 1), known mean and variance, sample repeatedly from this distr.
The 95% probability of range of sample means likely to be observed is:
(1.96 is 97.5th percentile of normal distribution, μ = 0, σ2 = 1).
Expression in absolute brackets is called probability interval. For fixed n, and
assuming known  and σ, sample mean will be in that interval with very high
probability. Note, repeated samples of size n and Z ~ N (0, 1).
For non-normal Z and large n, probability is approximately equal to .95
thanks to CLT, and sample mean is approximately normal.
1.96
(| | 1.96) (| | 1.96) (| | ) .95
/
 


−
 =  = −  =
n
n
X
P Z P P X
n n

***
In more realistic setting, unknown μ (assume known σ for ease),
derive from probability interval method to obtain confidence interval
for unknown parameters. Emphasis is on the method, which does not
depend on having observed any data. Interval
is random because
is random. The interval is fixed however for given sample. Once
calculated, interval is called “confidence interval” (CI) at (1 – α)%
level and is not random ➔ that any other sample derived from the
same population will generate a mean that will lie within the just found
confidence interval or not.
➔phrase “Confidence Interval contains  95% of the time” is
incorrect. Instead, CI is “set of parameter values best supported
by the data at the 95% level”, denoted as 1 – α, α = .05.
➔Bayesian ‘credible intervals’ answer the question.

Most basic test, one-sample test.
Want to test whether parameter beta1, usually “mean”, is equal to
specific value, for instance, 0. Take sample, obtain mean estimate (and
CLT, etc.). Is mean estimate close enough to 0 so that we cannot deny
that beta1 = 0?
Formally: H0: null hypothesis, beta1 = 0,
H1: alternative hypothesis, beta1 not zero.
Since std is estimated, distribution is NOT normal, but Student’s
t, with n – 1 degrees of freedom (from a table). t approaches Normal
distr. asymptotically, very quickly.
ˆ
ˆ
ˆ : parameter estimate

1
o 1 1 1
1 1
β
1
H :β = 0 H :β 0.
and the test is :
β - β
~ t(n- 1)
s
β

Nomenclature of Hypothesis testing.
Level of significance of test called alpha (α) to make a
statistical decision about the null hypothesis. It represents the
probability of incorrectly rejecting Hₒ when it is true. Typically
5% but not mandatory. Also called type I error.
Type II error, or β, probability of not rejecting H0 when it is false.
In general, no effort is made in finding out type II error, called
Power analysis, that works with 1 – β, probability of correctly
reject null hypothesis.
p-value: probability, assuming H0 is true, of observing a result
equal to or more extreme than what was actually observed. If p
< α, this suggests data is inconsistent with H0 and it may be
rejected.

Some clarifying notes
If beta1~ N (mu, sigma2) (appealing usually to CLT, normality of observation
distribution does not need to be normal), then standardizing ➔
beta1 mu
~ N(0,1),
sigma
or if sigma2 is estimated then
beta1 mu
~ t(n 1)
sigma / n 1
−
−
−
−
If standardized beta1 ~ N(0,1), then 95% of likely values are in [-1.96; 1.96]
interval ➔ beta1 1.96 * sigma / n 1
is 95% confidence interval
 −
Full equivalence between CI and p-value.
Next: testing Null: beta0 = 0, vs. alternatives. If data yields beta1 =
1, cannot reject null. If beta2 = 2.5, null rejected.p-value < 0.05.

P-value
For beta2.

Examples of inferential tests:
1) IQ change differences between 1st and 3rd elementary school
children? (i.e., change = 0?)
2) Are Republican voters more affluent than Democrat ones?
3) Is fertilizer A better than fertilizer B?
4) Is treatment (drug, vaccine, etc) good at treating disease?
(extreme heavy use of statistical inference)
5) DNA testing for crimes, paternity tests, etc.
Used in almost every aspect of science.

Which one is Null Hypothesis?
Strong tendency to nominate as Null whichever scheme that is
statistically easier to compute. E.g., in linear models (next
chapters), Null is lack of predictor effect, and focus is on inference
on estimated parameter value.
But suppose question: Is there life in other planets? Should Null be
NO and Alternative YES? Or Reverse?
In drug testing, should Null be NO ADVERSE EFFECT and reverse
for Alternative? But this puts burden of proof on Alternative. Why
not reverse it and then Null: There are Adverse Effects and burden
is on disproving it.
EG.: Testing screw on bridge. A: If it breaks, not much happens. But
in B:break-up ➔ serious accident. In A, Null is screw is safe, in B,
unsafe.

P-values.
Notion of p-values is central to statistical inference. P-value is probability of seeing
outcome (from data or experiment) under assumption that it originated from
starting hypothesis (called null), which could be in dispute but is still prevailing
view. E.g., flip assumed ‘fair’ coin 4 times, 50% probability of either Heads or
Tails, and obtain 4 tails (T). Question: is coin fair (i.e., prob (H) = .5)?
Under assumption p = .5, prob ( 4 tails / p = 0.5) = .54 = .0625, i.e., p-value is
.0625. Is probability small enough to indicate that we do not believe that p = .5?.
For instance, we could flip coin 100 times in sets of 4 to record how many sets
contained 4 tails and compare that proportion to .0625.
The typical threshold to reject the null hypothesis is 5%. In this example, we fail to
reject the null and that is all that can be said. If we had set the threshold at 10%
instead, we would judge the coin to be biased.
➔ p-value: probability of obtaining value of experiment, assuming that null
hypothesis holds. I.e., prob obtaining extreme value of experiment just by
chance. If p-value low (<= 5%?) ➔ null hypothesis should be abandoned, but we
don’t know in favor of what specific alternative.

More on p-value
It is NOT “prob (H0) is false”, because H0 is 100% true by
assumption. Remember, p-value is conditional on data at hand, H0
unconditional, p-value = 1 – Pr (H0/ data). Thus, we get Pr (H0/Data).
If p-value low, method cannot evaluate whether NULL is true and
sample was unusual, or NULL might be false.
Assume comparison of new marketing campaign results compared
to established one, and p-value = 3%. Interpretation: assuming that
campaigns results are similar, obtained difference in 3% or more
studies due to sampling error. It is INCORRECT to say that if we
reject NULL, there is 3% chance of making mistake. P-value ≠ [Prob
(error rate) = Prob (reject true H)]
For relationship between p-value and error rates see SELLKE et al (2001): Calibration of p Values for
Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

T-tests: one way, Paired, 2 groups.
Test can be one-sample, e.g., comparing variable mean to
specific value to decide on POPULATION mean μ .
Example (DS1): Is mean of member_duration = 150? H0: mean =
150, H1 mean_duration ≠ 150. Since p-value < 0.05 (our threshold) ➔
mean is not 150.
Paired: Could be testing math and reading scores on students, thus
data is dependent on each subject. Or frontal left and right tires, data
is paired within car. (no example provided).
0 0
0
1 1
2

 
− −
= 
−
=
 
( )
: , : , : , 2-sided test.
| |
: t-score: compared to t
if t t-value reject H ("significant finding",
else fail to reject H.
n
H x A x x estimated mean
x x
test
s
n

T-tests cont.
Testing across 2 groups.
Compare same variable across two groups, say male vs female,
young vs old, treated patient vs. non-treated, etc. H:
(and alternative, that they are different from zero.) Have 2 variances now.
Issue: whether variance of each group is equal to each other. If
equal, use weighted average of the two (pooled variances).
Otherwise (unpooled), Satterthwaite’s method provides
asymptotic degrees of freedom computation.
DS1 example: compare means of member duration by fraud level.
2 0  − =
1 2
1 2
1 1
p
x x
t
s
n n
−
=
+
2 2
1 2
1 2
s s
Unpooled Std Error
n n
= +

One way
2 groups

Leonardo Auslender –Ch. 1 Copyright 2004
Reject H0 in all cases.
Inference Tests.
t Value
P-
value DF
Variable model_name Method
13.49 0.00 5958.00member_duration TTEST_2GROUPS Pooled
Satterthwaite 14.36 0.00 1982.49
TTEST_ONEWAY 154.84 0.00 5959.00
SAS CODE
ODS LISTING CLOSE;
ods output ttests = ttests statistics =
statistics; %put;
proc ttest data = &INDATA. var &groupvar.; ;
class &classvar.; run;
ods listing;

Important comments on NHST (Null Hypo. Stat. Testing)
Is “significant finding” important, or substantial? Not given by statistical testing but by expert
knowledge of event at hand. Notice that by increasing sample size ‘n’, t-test denominator
decreases ➔ higher chances of finding ‘significance’, but finding may be irrelevant.
But more than this: We are testing
When we want
NHST fails to reject H0 because it is not being tested.
Example:
Valid reasoning Invalid reasoning
1) All persons are mortal. 1) Most women don’t like soccer.
2) Socrates is a person. 2) Mia Hamm played soccer
3) ➔ Socrates is mortal. 3) Mia Hamm is not a woman.
The invalid reasoning conclusion 3) is a false negative due to ‘most’ in 1) which is Ho. The
test part is given by 2) and, accepting that 1) is true, concludes 3) which is wrong.
1 1 0 = ˆPr( / )H
0 1  = ˆPr( is true / )H

important comment on present practice of NHST
Strong tendency to become priests and adorers of <= 5% p-value,
regardless of whether findings are of practical importance.
Strong tendency to search, transform, manipulate to obtain sacred <=
5% number.
Strong tendency to underreport results that don’t reach 5% p-value.
Strong tendency to disregard Type I vs. Type II error (see next slide).
NHST mostly focuses on TYPE I error.
5% p-value is not part of Constitution, Magna Carta, Ten
Commandments, Koran or any other important document.
And worse: present practice, p-hacking (or data-dredging-
snooping-fishing, cherry-picking….).

Type I vs Type II errors.
From https://heartbeat.fritz.ai/classification-model-evaluation-90d743883106

1) Focus analysis just on data subset where interesting pattern
was found.
2) Disregarding multiple comparisons adjustments, and non-
Reporting of non-significant results.
3) Using different tests (eg., parametric vs. non-parametric) on
same hypothesis and only reporting significant results.
4) Removing ‘outliers’ to prove given hypothesis, or choosing
Data points to obtain significance. Or dropping variables because
of imaginary or not problems (e.g., co-linearity in linear models).
5) Transforming data, especially when modeling, for significant
discoveries.
How bad is it in applied work?

P-hacking prevalence: almost universal except for saints (see
Berman, 2018 for p-hacking in marketing).
Note that p-hacking is based on using statistical inference.
Methods that do not use statistical inference are not affected.
Also, p-hacking can be understood as inferential EDA and thus
requires additional data sets to validate the just found
hypotheses.
Data Science typically works by splitting the (enormous) data set
in at least 3 datasets (typically randomly):
-Training: Find hypotheses, models.
-Validation: Search hypotheses without using inference, without
using model search and obtain estimates.
-Testing: Verify hypotheses if validated. Use prior estimates to
obtain final model results.

But ………….
Just validating and testing does not ensure full verification.
Further techniques:
Cross Validation (web, see reference at end of presentation.)
Bootstrapping – jacknifing (1979)

Multiple comparisons ***
Assume not one but 20 hypotheses tested on same data set with α = 5%. While
prob rejecting H0 when H0 true is 5%, prob (at least one significant result out of
20) = 64% = 1 – P (no overall significance) = 1 – (1 – 0.05)20.
Different methods to handle it: Bonferroni, false Discovery Rate (FDR),
positive FDR (pFDR), etc.
Present unfortunate practices:
1) Very common to just focus on significant findings and disregard presenting
all information on non-significant findings. Example: study on opera singing
and sleeping posture finds singers tend to sleep in specific position. But
study omits that most comparisons were insignificant, from diet differences,
sleep medication usage, gender, sex, etc.
2) Very common to just report on what is convenient. Company tests new drug
efficacy on 2 different groups of people and results are positive for one but
negative for other. Company reports just positive result.

Sampling practices
Sampling seldom done with replacement ➔ sample size should be
large enough to:
1) Diminish probability of spurious dependence across
observations.
2) Ensure that rare categories in nominal variables are still
represented. Essentially, sampling w/o replacement in small
samples ➔ proportions are distorted.
3) Provide enough observations so that overall cardinality of the
data set is still representative of the population.
Under these conditions, rule of thumb is to sample %10 of the
population at least. Bear in mind that for very large p / n
(undefined), the percentage should be higher.

Sampling.
Often entire population data too large or expensive to obtain.
E.g., population data on US males average weight in December
2016.
In other cases, study sample to analyze results expected in
future populations. E.g., population of credit card payers in
future years. Expectation is that future population properties
represented in present sample data.
Two big types of sampling:
a) probabilistic – random sampling.
b) non-probabilistic

Random Sampling
Simple Random Sampling: equi-probability of selection for each
observation.
Stratified sampling: random sampling within strata level of one (or
few) variable/s, typically categorical. Heavily used in classification
methods).
Systematic sampling: observations chosen at regular data
intervals.
Cluster Sampling: sampling within clusters of observations.
Multi stage Sampling: mixture of previous methods.

Non-random sampling
Convenience Sampling: based on data availability.
Purposive Sampling: sample only from those elements of
interest. E.g., sample those who responded to credit card
solicitation.
Quota Sampling: sample until exact proportion of certain
characteristics is achieved.
Referral /Snowball Sampling: sample is created by referrals from
participants to participants. E.g., sampling of HIV patients, pick-
pokets… Method subject to high bias and variance, and cannot
claim statistical significance status of random sampling.

Issues in sampling
Assume research on bus quality of service, for which you
ask questions from people waiting at the bus stop. One
main concern for quality of service is late buses.
You obtain many answers and report on your results.
Problem: people who respond are those more likely to have
waited longer than those already on the bus because you
sample at bus stop. Thus, answers will be biased toward
poor service.

Normality Check: QQPlots
Plots compare order values of variable of interest (Y axis) with
quantiles of normal distribution. If pattern is linear ➔ variable is
normally distributed. Overlayed reference straight line indicates
perfect normality.
Description of Point Pattern Possible Interpretation
all but a few points fall on a line outliers in the data (later reviewed)
left end of pattern is below the line; right end
of pattern is above the line
long tails at both ends of the data
distribution
left end of pattern is above the line; right end
of pattern is below the line
short tails at both ends of the data
distribution
curved pattern with slope increasing from left
to right
data distribution is skewed to the right
curved pattern with slope decreasing from
left to right
data distribution is skewed to the left
staircase pattern (plateaus and gaps) data have been rounded or are discrete

Outlier.
Right Skew.
Addl. Examples

Left Skew

data hello2;
length varname $ 32;
set training (in = in1 keep = DOCTOR_VISITS rename = (DOCTOR_VISITS =
varvalue) ) training (in = in2 keep = FRAUD rename = (FRAUD = varvalue) ) training
(in = in3
keep = MEMBER_DURATION rename = (MEMBER_DURATION = varvalue) ) training (in = in4
keep =
NO_CLAIMS rename = (NO_CLAIMS = varvalue) ) training (in = in5 keep = NUM_MEMBERS
rename =
(NUM_MEMBERS = varvalue) ) training (in = in6 keep = OPTOM_PRESC rename =
(OPTOM_PRESC =
varvalue) ) training (in = in7 keep = TOTAL_SPEND rename = (TOTAL_SPEND = varvalue)
) ;
if in1 then varname = "DOCTOR_VISITS";
if in2 then varname = "FRAUD";
if in3 then varname = "MEMBER_DURATION";
if in4 then varname = "NO_CLAIMS";
if in5 then varname = "NUM_MEMBERS";
if in6 then varname = "OPTOM_PRESC";
if in7 then varname = "TOTAL_SPEND";
LABEL VARNAME = "Var Name" Varvalue = "Variable";
run;
PROC UNIVARIATE DATA = hello2 NOPRINT;
class varname;
VAR varvalue;
qqplot / NCOL = 3 NROW = 1;

Homework
Select your data set and software. Obtain Qqplots for some
variable/s and diagnose them.

The Dow Jones index fell 22.61% in one day in October of 1987. It was a “25-
standard deviation event” according to the prevailing calculations.
That is an occurrence so rare that if the stock market had been open every single
day since the Big Bang… it still shouldn’t happen. And yet it did.
Comment on how the standard deviation was possibly derived and how the 25 std
event was used to justify that it shouldn’t have happened. And let’s not think
about 2007-2009 for the time being.

TV game show. Host (Monty Hall) shows 3 doors to contestant. Behind two of the
doors, there are goats, behind the other one, a car. If the contestant chooses the
right door, he/she wins the car.
Twist: Once the participant has chosen a door and before Monty opens it (and
Monty knows behind which door the car is located), Monty opens one door that
he knows has a goat behind.
At this point, Monty offers the participant the chance to switch his/her election to
the remaining door.
Should the participant switch? Yes? No? Why?

The immediate reaction is that the initial probability of success is 1/3 and after
Monty opened the goat door, the probability is then ½ and thus there is no gain in
switching or staying with the original choice.
Wrong. Let us call the doors 1, 2 and 3. The following table summarizes it all
when door 1 is chosen initially (https://en.wikipedia.org/wiki/Monty_Hall_problem)
Thus, the probability of winning if the participant switches is 2/3.
Note: This puzzle raised intense and acrimonious debates among statisticians.

Leonardo Auslender –Ch. 1 Copyright 20049/17/2019
3. Tea and/of coffee?
In a large group of people, it is known that %70 of them
drink coffee and %80 drink tea. What is the lower and
upper bound of those who like both? (In obvious notation
C and T)
We know that total probability is:
( ) ( ) ( ) ( ) 1
.7 .8 1 ( ) ( )
( ) 0 (everybody drinks something) ( ) .5
( ) .8 max ( ) .2 ( ) .7
P P T P C P T C P T C
P T C P T C
If P T C P T C
Since P C P T C P T C
= + − + = 
+ = + −
=  =
=  =  =

Go, Bayes
Go!!
(Under construction)
***

Bayesian inference (BI).
Example of coin tossing. In BI, establish parameters and models.
An important part of Bayesian inference is establishment of parameters and models.
Models: mathematical formulation of observed events.
Parameters: factors in models affecting observed data.
For example, in coin tossing, fairness = parameter of coin denoted by θ. Outcome
denoted by D.
Q.: What is Pr ( 4 heads out of 9 tosses) (D) given coin fairness, i.e., P(D / θ)? This is
frequentist way of looking at problem.
More interested in knowing: Given D (4 heads out of 9 tosses), what is Pr (θ ) = 0.5? By
Bayes’ Theorem:
( / ) ( )
( / ) ,
( )
P D p
p D Theta
p D
 
 = =

If we know θ, likelihood would give P (D) for any event. p(D) is ‘evidence’,
estimated over all possible values of θ (could be more than one).
Significance Tests
P-value
Frequentist: calculate t-score from fixed size sample, obtain p-value . If p-value is 3% for mean
= 50, ➔ there is 3% probability that sample mean will equal 50, assume some H.
Problem: when n changes for different sample sizes, t-scores change also ➔ 5% threshold is
meaningless to decide on whether to reject or not H.
( / ):posterior belief or probafter observingD.p D

CIs
Cis are p-value cousins ➔ same problem with changing n. Since CI not a
probability distribution, impossible to know which values are most probable.
Bayes Factor
Bayesian equivalent of p-value.
Bayesian H assumes “∞” prob distr only at particular value of
parameter (θ = 0.5 for fair coin) and zero prob elsewhere. (M1)
Alternative hypo: all values of θ are possible ➔ flat curve distr (M2)
Next slide shows prior situation (left panel) and posterior situation (right panel).

Given 4 heads out of 9, prob. has shifted from 0.5 to something
larger given by panel B. In A, left and right bars indicate prob values
of H and A. In panel B, H is now < 0.5 and A > 0.5.
Bayes factor: ratio of posterior odds to prior odds. Reject H if BC < .1 (suggested)
Benefits of using Bayes Factor instead of p-values: no sample size effect.
( / ) ( )
/
( / ) ( )
p H D p H
BF
p A D p A
=

Credible interval
Bayesian solution yields ‘posterior’ probability distribution function (pdf).
Assume it is continuous. Then for any two points (a < b) in range of pdf, it is
possible to assert
Pr (a <= X <= b) = c.
If we invert the problem and ask what are the values a and b that yield a
probability c for X in pdf, then [a ; b] is a c% credible interval.
There is infinite number of interval that can yield c% probability. If distribution
is symmetric, range is taken such that (1 – c/2)% is left out in each tail.
With credible intervals, it is possible to ascertain that interval contains true
value with c% probability, which is not possible with traditional Cis.

Random events along discrete time index (t) (months, steps,
number of tests, etc.), measuring cumulative outcome effect. Time
index theoretically goes to infinity.
Example: Flip fair die, if even, walk 100 meters north, else 100
meters south, and repeat 100 times.

References
Berman R. et al (2018): p-Hacking and False Discovery in A/B Testing, web.
Cross-validation: https://www.cs.cmu.edu/~schneide/tut5/node42.html
Efron B. (1979): Bootstrap methods: Another look at the jackknife“
The Annals of Statistics, 7(1): 1-26
Tim Bollerslev & Robert J. Hodrick, 1992: Financial Market Efficiency Tests,
NBER Working Papers 4108, National Bureau of Economic Research, Inc.

2 ueda

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2 ueda

Similar to 2 ueda (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

2 ueda