PHIL 6334 - Probability/Statistics Lecture Notes 1:
Introduction to Probability and Statistical Inference
Aris Spanos [Spring 2014]

1

Introduction

1.1

Important distinctions

1. The first distinction that needs to be kept in mind when dealing with probability
and inference is that between:
(i) the real world of actual data,
(ii) Plato’s world of mathematics
Conflating the two can easily lead to serious category mistakes.
2. The second distinction pertains to two different types of information in empirical modeling:
[i] statistical information reflected in chance regularity patterns,
[ii] substantive subject matter information
This distinction is important because the primary objective of empirical modeling is
‘to learn from data’ about observable stochastic phenomena of interest using statistical models. A statistical model aims to account for all the systematic statistical
information in the data and functions as a mediator between the data and the phenomenon of interest. Substantive subject matter information sharpens and enhances
this learning from data when it does not contradict the statistical information. The
ultimate aim is to blend the two sources of information after validating that information vis-a-vis the data.
3. The third distinction within probability theory pertains to the basic concepts
defining the mathematical set up:
(a) ( F P()): events and probabilities, e.g. P(|) ∈F ∈F
(b) {(; ) ∈Θ ∈R } : random variables and probabilities
In statistical modeling the appropriate set up to be used is (b), because real-world
data are usually expressed in terms of numbers. However, people who have been
trained to think in terms of (a) often conflate events with random variables and
statistical hypotheses, leading to serious muddiness in discussions of inference.
Example 1: "what is the probability that Joan has the disease given that she tested
positive?" is an event, not a legitimate hypothesis for frequentist testing.
Example 2: be suspicious of notation like  (|) or  (|) where  and 
denote evidence and a hypothesis;  is never an event in frequentist inference; see
Spanos, A. (2010), “Is Frequentist Testing Vulnerable to the Base-Rate Fallacy?”
Philosophy of Science, 77: 565-583.
1
1.2

Dual role of probability theory

Probability theory is part of mathematics that provides the foundation and the overarching framework for:
(1) Statistical modeling: the mathematical concepts needed to model (describe) chance regularity patterns in the form of a statistical model M (x) : a set
of probabilistic assumptions specifying a generating mechanism that could have given
rise to the data x0 := (1  2       ), and
(2) Statistical Inference: the framework for model-based (M (x)) statistical
inference - estimation (point and interval), testing, prediction and policy analysis.
Probability theory provides ways to ‘calibrate’ the reliability and optimality of inference procedures using error probabilities.

2

Phenomena amenable to statistical modeling

Stochastic: observable phenomena which exhibit chance (non-deterministic) regularities are amenable to statistical modeling. Such phenomena vary from simple
games of chance (tossing coins, casting dice, playing the roulette, etc.), to highly
complicated experiments in physics and chemistry, as well as observable phenomena
in economics, astronomy, geology, the biological and social sciences.
• Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T).
• Example 2. Sampling with replacement from an urn which contains red (R)
and black (B) balls.
• Example 3. Observing the gender (B or G) of newborns during a certain
period (a month) in NY city.
• Example 4. Casting two dice and adding the dots of the two sides facing up.
• Example 5. Tossing a coin twice and noting the outcome.
• Example 6. Tossing a coin until the first "Heads" occurs.
• Example 7. Counting the number of emergency calls to a regional hospital
during a certain period (a week).
––—
• Example 8. Sampling without replacement from an urn which contains red
(R) and black (B) balls.
• Example 9. Observing the daily changes of the Dow Jones (D-J) index during
a certain period.
• Example 10. Observing the annual average global surface air temperature
1856-2011.
2
2.1

Chance regularity vs. deterministic regularity

Deterministic regularity can be fully described using mathematical functions
among variables, e.g.  = cos( ),  = 1 2     
T im e S e r ie s P lo t o f y
1 .0

y

0 .5

0 .0

- 0 .5

- 1 .0
1

50

100

150

200

250
In d e x

300

350

400

450

500

Fig. 1: t-plot of  = cos( )
Stochastic regularity: data x0 :=(1  2       ) (table 1) shown in fig.2.
 

P lo t o f x
12

10

x

8

6

4

2
1

10

20

30

40

50
In d e x

60

70

80

90

100

Fig. 2: t-plot of data that exhibit chance regularities
Table 1 - Observed data on –––
3

10 11 5

6

7

10

8

5

11 2 9 9 6

7

8

5

4

6

11

7

10

5

11

8

9

5

7

3

4

9

10

3

6

9

7

5

8

5

7

7

6 12

9

10

8

4

7

6

5

12

8

7 5 9 8 10 2

7

3

8

10

10

4

7 4 6 9

6 12 8 11

9

6

2

9

6 4 7 8 10 5

4

8

6

5 4 7 8

7
6

8

7

9

6

7 11 7

8

3

Just glancing through the numbers one cannot easily detect chance regularity patterns, hence the various graphical techniques designed to bring out such patterns.
Chance regularities: [D] Distribution, [M] Dependence, [H] Heterogeneity
3
[D] Distribution chance regularities
Thought experiment 1. Think of the observations as little squares with equal
area and rotate the figure 2 clockwise by 90◦ and let the squares representing the
observations fall vertically creating a pile on the -axis.
[1] Distribution: when  is large enough the data form a perceptible shape
(histogram).

Fig. 3: A typical realization of a NIID process

Fig. 4: Assessing the Distribution assumption
4
Let us return to the data in table 1.
 

P lo t o f x
12

10

x

8

6

4

2
1

10

20

30

40

50
In d e x

60

70

80

90

100

Fig. 2: t-plot of data that exhibit chance regularities
Applying thought experiment 1 to the data depicted in figure 2 yields the histogram
given below.
 

H is togr am of x
18
16
14

Frequency

12
10
8
6
4
2
0

2

4

6

8

10

12

x

Fig. 5: The histogram of the data in fig. 2
As argued below, the above data x0 := (1  2       ) could be viewed as a typical
realization of an Independent and Identically Distributed (IID) process {1  2     }
generically denoted by {  ∈N = (1 2   )} whose distribution appears to be
symmetrically triangular. How do we assess IID using graphical techniques?
Assessing [M] Dependence chance regularities
Thought experiment 2. Hide the observations following a certain value of
the index, say  = 40, and try to guess the next outcome. Repeat this along the
observation index axis and if it turns out that it is impossible to use the previous
observations to guess the value of the next observation, excluding the extreme cases 2
and 12, then the chance regularity pattern we call independence is present. Intuitively,
we can summarize this notion:
5
[2] Independence: in any sequence of trials the outcome of any one trial does
not influence and is not influenced by that of any other.

Fig. 6: A typical realization of a positively dependent process

Fig. 7: A typical realization of a negatively dependent process
T y p ic a l re a liz a t io n o f a B e rn o u lli I I D p ro c e s s
1 .0

0 .8

x

0 .6

0 .4

0 .2

0 .0
1

10

20

30

40

50
Inde x

6

60

70

80

90

100
T y p ic a l re a liz a t io n o f a d e p e n d e n t B e rn o u ll p ro c e s s
1 .0

0 .8

z

0 .6

0 .4

0 .2

0 .0
1

10

20

30

40

50
Inde x

60

70

80

90

100

What can one say about dependence for the data in table 1 (fig. 2)? They exhibit
no dependence patterns.
Assessing [H] Heterogeneity chance regularities
Thought experiment 3. Take a wide frame (to cover the spread of the fluctuations in a t-plot such as figure 2) that is also long enough (roughly less than half the
length of the horizontal axis) and let it slide from left to right along the horizontal
axis looking at the picture inside the frame as it slides along. In the case where the
picture does not change significantly, the data exhibit homogeneity.
Another way to view this pattern is in terms of the arithmetic average and the
variation around this average of the numbers as we move from left to right. It appears
as though this sequential average and its variation are relatively constant around 7.
The variation around this constant average value appears to be within constant bands.
This chance regularity can be intuitively summarized by the following notion:
[3] Homogeneity: the probabilities associated with the various outcomes remain
identical for all trials.

Fig. 8: A typical realization of a NIID process

7
Fig. 9: A typical realization of a mean-heterogeneous process
What can one say about heterogeneity for the data in table 1 (fig. 2)? The data
exhibit no heterogeneity patterns.
 

P lo t o f x
12

10

x

8

6

4

2
1

10

20

30

40

50
In d e x

60

70

80

90

100

Fig. 2: t-plot of data that exhibit chance regularities
The data in fig. 2 should be contrasted with the data in figures 10-11.
t- P lo t o f D o w J o n e s In d e x
12000

Dow Jones Index

10000

8000

6000

4000

2000
1

253

506

759

1012

1265
t im e

1518

1771

2024

Fig. 10: t-plot of the Dow Jones Index
8

2277
T e m p e r a tu r e d a ta
0 .5 0

y

0 .2 5

0 .0 0

- 0 .2 5

- 0 .5 0
1856

1880

1904

1928
Year

1952

1976

2000

Fig. 11: Surface air temperature, annual series
The data sets depicted in fig. 10-11 exhibit departures from the IID assumptions;
they indicate both temporal dependence and heterogeneity.

2.2

Empirical regularities → probabilities

The question that naturally arises is whether the available substantive information
pertaining to the mechanism that gave rise to the data in fig. 2 would affect the
choice of a statistical model. Common sense suggests that it should, but it is not
clear what the extent of that influence should be. Let us discuss that in stages.
2.2.1

The Data Generating Mechanism (DGM)

The data in table 1 (fig. 2) were generated by a sequence of =100 trials of casting
two dice and adding the dots of the two sides facing up. This game of chance was very
popular in medieval times with soldiers waiting for weeks on end to assail European
cities they had under siege. After thousands of trials these illiterate soldiers learned
empirically that the number 7 occurs more often than any other number and that 6
occurs less often than 7 but more often than 5 etc.

Historically, the formalization process from chance regularity patterns to
probabilities has taken several centuries to come together.
[a] All possible distinct outcomes are known a priori. The first crucial feature of
the DGM is its stochastic nature: at each trial (the casting of two dice) the outcome
(the sum of the dots of the sides) cannot be predicted with any certainty. The only
thing one can say for sure is that the result of each trial will be one of the events
(numbers):
{2 3 4 5 6 7 8 9 10 11 12} 
9
[b] in any particular trial the outcome is not known a priori, but there exists
a perceptible regularity of occurrence associated with these outcomes.
The second key feature of the DGM is that, under certain conditions, these events
do not occur equally often in this game of chance. It was obvious to the medieval
soldiers that certain events like 7 occur more often than others like 12.
What are these conditions? The third feature of the generating mechanism is:
[c] it can be repeated under (roughly) identical conditions.
These conditions are of paramount importance in modeling stochastic phenomena
because they pertain to the validity of the premises of inference. In this case they
pertain to the physical symmetry of the two dice and the homogeneity of the replication process. In the actual experiment the dice were cast in the same wooden box to
secure a certain form of nearly identical circumstances for each trial.
The question that naturally arose at the time was whether one could explain
(account for) the differences in the empirical relative frequency of occurrence of these
results shown by the histogram (fig. 5) using some kind of mathematical argument.
If one could figure out the odds for different events, one could make money! The
first systematic account of the underlying mathematical reasoning behind fig. 5 was
given by Girolamo Cardano, a notorious gambler, during the 16th century in Italy.
2.2.2

The probabilities behind the empirical frequencies

Cardano was able to get to the probabilities behind the observed relative frequencies
in four steps.
Step 1. He reasoned that since each die has 6 faces (1 2  6)  casting two
dice will give rise to 36 possible (elementary) outcomes associated with the different
pairings of these numbers (1  2 ) (table 2), where  denotes the number of dots of
the −th die, =1 2

1
2
3
4
5
6

Table 2 - Casting two dice:
set of elementary outcomes
1
2
3
4
5
(1,1) (1,2) (1,3) (1,4) (1,5)
(2,1) (2,2) (2,3) (2,4) (2,5)
(3,1) (3,2) (3,3) (3,4) (3,5)
(4,1) (4,2) (4,3) (4,4) (4,5)
(5,1) (5,2) (5,3) (5,4) (5,5)
(6,1) (6,2) (6,3) (6,4) (6,5)

6
(1,6)
(2,6)
(3,6)
(4,6)
(5,6)
(6,6)

Step 2. He reasoned that if the casting is repeated under similar conditions and
the two dice are symmetric and homogeneous, then all elementary outcomes (1  2 )
1
in table 2 are equally likely to occur, and thus one can attach a probability 36 .
Step 3. He reasoned that the events of interest {2 3  12} can be seen as arising
from adding 1 +2 for each pair in table 2. He collected the different ways the events
10
1 + 2 can arise from elementary outcomes and formed table 3.

(2,3)
(2,2) (3,2)
(2,1) (3,1) (4,1)
(1,1) (1,2) (1,3) (1,4)
2
3
4
5

(3,3)
(2,4)
(4,2)
(5,1)
(1,5)
6

(4,3)
(3,4)
(5,2)
(2,5)
(6,1)
(1,6)
7

(4,4)
(5,3)
(3,5)
(6,2)
(2,6)
8

(4,5)
(4,5) (5,5)
(6,3) (6,4) (6,5)
(3,6) (4,6) (5,6) (6,6)
9
10
11
12

Table 3: Distribution of events (Plato’s world)
Step 4. He realized that if each of the elementary outcomes in table 2 has
1
probability of occurrence 36  then one can evaluate the probabilities associated with
events {2 3  12} by simply adding these probabilities.
For instance, we know that the number 2 can arise as the sum of a single combi1
nation of faces: {1 1} - each die comes up 1 hence  (1 +2 =2)= ({1 1})= 36 
1
the same applies to the number 12:  (1 +2 =12)= ({6 6})= 36  On the other
hand the number 3 can arise as the sum of two sets of faces: {(1 2) (2 1)}  hence
2
 (1 +2 =3)= ({(1 2) (2 1)})= (1 2) +  (2 1)= 36  The same calculation applies
2
to the number 11:  (1 +2 =11)= ({(6 5) (5 6)})= 36  Similarly, the calculation of
probability for number 7 is:
 (1 +2 =7)=  ({(1 6) (6 1) (2 5) (5 2) (3 4) (4 3)})=
6
 (1 6)+ (6 1)+ (2 5)+ (5 2)+ (3 4)+ (4 3)= 36
Continuing this line of reasoning Cardano was able to construct the probability
distribution in table 4 that relates each event of interest in {2 3  12} with its
probability of occurrence.
Events

2

3

4

5

6

7

8

9

10 11 12

Probabilities

1
36

2
36

3
36

4
36

5
36

6
36

5
36

4
36

3
36

2
36

1
36

Table 4 - Probability distribution for the sum of two dice

2.3

Probabilities → empirical regularities

The probability distribution in table 4 represents a probabilistic concept (Plato’s
world) formulated by mathematicians in order to capture the distribution chance
regularity exhibited by the data in fig. 2 and summarized by the histogram in fig. 5.
A direct comparison between figures 5 and 12 confirms the soldiers’ intuition in the
sense that the empirical frequencies in fig. 5 are close to the theoretical probabilities
shown in table 4 (fig. 12). Moreover, if we were to repeat the experiment, not
100 but 1000 times, these frequencies would have been even closer to the theoretical
probabilities. In this sense we can think of the histogram in fig. 5 as an empirical
instantiation of the probability distribution in table 4 (fig. 12).
11
H i s to g r a m o f x
18
16
14

Frequency

12
10
8
6
4
2
0

2

4

6

8

10

12

x

Fig. 5: The histogram (real world)

¥
¥ ¥
¥ ¥ ¥
¥ ¥ ¥ ¥
2 3 4 5

¥
¥
¥
¥
¥
6

¥
¥
¥
¥
¥
¥
7

¥
¥
¥
¥
¥
8

¥
¥
¥
¥
9

¥
¥ ¥
¥ ¥ ¥
10 11 12

Fig. 12: Probability distribution (Plato’s)
The probability distribution in table 4 represents an oversimplified simple statistical model where one assumed certain conditions of symmetry, homogeneity and
sameness to evaluate the relevant probabilities without the presence of any unknown
parameters. In practice one needs to assess some of these preconditions before imposing them and model validation is the first step before the statistical model is used for
inference purposes. An important component of model validation is formally test the
relationship between fig. 5 and table 4 to ensure that, indeed, the former could have
been generated by a statistical model with the latter as the assumed distribution.
The step from the observed regularities to their formalization (mathematization)
was prompted by the distributional regularity pattern of the data in fig. 2 as exemplified in fig. 5 in the form of a histogram. The formalization itself was initially very
slow, taking centuries to materialize, beginning with figuring out simple combinatorial (combinations and permutations) arguments. We can capture the essence of this
early formalization by taking the dice casting example one step further.

2.4

Same experiment, different questions of interest

When playing the casting of two dice, the medieval soldiers used to gamble on whether
the outcome is an odd or an even number (the Greeks introduced these concepts at
around 300 B.C.). That is, the soldiers would bet on one of two events:
= {3 5 7 9 11}  = {2 4 6 8 10 12}
12
At first sight it looks as though betting on  will be a sure winner because there
are more even than odd numbers. The medieval soldiers, however, knew by empirical
observation that this was not true, it was a fair bet! We can confirm their intuition
using the distribution in table 4:
2
+ 4 + 6 + 4 =1
36 36 36 36 2
1
3
5
5
3
1
= (2)+ (4)+ (6)+ (8)+ (10)+ (12)= 36 + 36 + 36 + 36 + 36 + 36 = 1
2

 () = (3) +  (5) +  (7) +  (9) +  (11) =
 ()

This gives rise to the distribution in table 5.
Events
Probabilities

= {3 5 7 9 11} = {2 4 6 8 10 12}
1
2

1
2

Table 5- The probability distribution of odd (A) and even (B)
This is an important aspect of modeling because it determines the particular
statistical model that can be used to answer the questions of interest.

3

Stochastic phenomenon → a probability model

Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T).
Example 2. Sampling with replacement from an urn which contains red (R) and
black (B) balls.
Example 3. Observing the gender (B or G) of newborns during a certain period
(a month) in NY city.
Example 4. Casting two dice and adding the dots of the two sides facing up.
Example 5. Tossing a coin twice and noting the outcome.
Example 6. Tossing a coin until the first "Heads" occurs.
Example 7. Counting the number of emergency calls to a regional hospital during
a certain period (a week).
Examples 1-7 constitute the simplest form of a stochastic phenomenon, which can
be both experimental (1-2, 4-6) and observational (3,7), described as
a Random Experiment (E):
[a] All possible distinct outcomes are known a priori,
[b] in any particular trial the outcome is not known a priori, but there exists
a perceptible regularity of occurrence associated with these outcomes,
[c] it can be repeated under (largely) identical conditions.
What renders them simple is condition [c] which give rise to what are known as
random samples.
Examples 8-10 (section 2) constitute more complicated stochastic phenomena,
both experimental (8) and observational (9-10), whose outcomes cannot be realistically viewed as random (IID) samples.
13
3.1

Mathematical framing I: Events and Probabilities

[a] Formalizing "all possible distinct outcomes"
This is achieved by defining the set  of all possible outcomes.
Examples:
1 = {  } 2 = { } 3 = { } 4 = {2 3  12}
5 = {( ) (  ) ( ) (  )} 6 = {          }
7 = {0 1 2 } 8 = { } 9 = R:=(−∞ ∞) 10 = R+ :=(0 ∞)
[b] Formalizing the "perceptible regularity"
The formalization begins with the crucial distinction:
outcome:  ∈  vs. event:  ⊂ 
That is, an outcome is an element of  but an event is a subset of  i.e. an event is
any combination of outcomes in 
Step 1: We define the set F of events of interest and related events associated
with the experiment comprising a set of subsets of .
Examples:
F1 :={ ∅   } where  denotes the sure event, ∅ the impossible event,
F2 :={ ∅  } F3 :={ ∅  } F4 :={ ∅  } ={3 5 7 9 11}
={2 4 6 8 10 12} assuming one is interested in an odd or an even number.
F5 :={ ∅  } ={() ( )} ={(  ) ( )}
F8 :={ ∅  }.

Mathematically F is a field: a set of subsets of  which is closed under the set
theoretic operations of union (∪), intersection (∩), and complementation (− ).
Examples:
In the case of F4 :={ ∅  } ∪=∈F4  ∩=∅∈F4  =∈F4  =∈F4 
Step 2: We assign probabilities to all the elements of F using:
P(): F → [0 1] a set function which satisfies the axioms:
[A1] P() = 1,
[A2] P() ≥ 0, for any event ∈F,
[A3] For  and  in F such that  ∩ =∅, P( ∪ )=P()+P()
The probability space ( F P()) provides an idealized description of the stochastic mechanism that gives rise to the events of interest and related events F, with
P() assigning probabilities to events in F Intuitively, an event is something that
might or might not occur in a specific trial, and thus the set theoretic operations ∪,
∩, − give rise to related events.
Example 1. In the case of Example 1 [Tossing a coin and noting the outcome],
the probabilities assigned are:
P()=1 P(∅)=0 P()= P( )=1− 0 ≤  ≤ 1
14

(1)
3.2

Mathematical framing II: Random Variables and Probabilities

The above mathematical formulation based on ( F P()) is highly abstract and not
particularly convenient for modeling purposes because data often come in the form
of numbers!
For modeling purposes we need recast the above mathematical formalization into
a formulation that relates the events of interest to real numbers. The key to such a
recasting is provided by the notion of a random variable.
A random variable  is a real-valued function from the set of outcomes to the
real line:
():  → R
that upholds the event structure of interest in F. That is, each event defined by
{ : ()=} belongs to F for all ∈R
Example 4. In light of the fact that F4 :={ ∅  } ={3 5 7 9 11}
={2 4 6 8 10 12} the real-valued function:
()=1 ()=0
defines a random variable with respect to F4 because the events generated by 
are always elements of F4  since: {: ()=1}=∈F4  {: ()=0}=∈F4  {:
()=}=∅∈F4  for all  6= 0 or  6= 1
Example 5. In light of the fact that 5 ={( ) (  ) ( ) (  )} and
F:={ ∅  } ={() ( )} ={(  ) ( )}, the real-valued function  :
 ()= (  )=0  ( )=1  ( )=2
is not a random variable with respect to F (above) because we can show that some
events generated by  () do not belong to F5  e.g. { :  ()=0}={() (  )}∈F5 

However, this does not preclude the possibility of defining a different field relative to
which  constitutes a random variable. How? First, generate the primary events of
interest associated with  :
{:  =0}={() (  )}:= {:  =1}={( )}= {:  =2}=( )}:=
and, second, use these primary events (  ), to define the related events specifying
the field generated by  :
( )={ ∅     ∪   ∪   ∪ }
Verify that ( ) is closed under the set theoretic operations of union (∪), intersection
(∩), and complementation (− )!
In the where () is a random variable relative to F, all events {: ()=} for
all ∈R belongs to F, thus one can attach probabilities via:
P({ : =})=() for all ∈R
15
where now we traded the set function P(): F →[0 1] with the density function ():
R →[01] which is a point to point numerical function.
In Example 4 above, the density function is the Bernoulli:
½
1− for =0
 (; )=
 or (; )= (1 − )1−  =0 1
 for =1
3.2.1

The notion of a Probability model

Using the concept of a random variable one transforms the original formalization
( F P()) into something more data friendly and richer in mathematical structure:
()

( F P()) =⇒ {(; θ) θ∈Θ ∈R }
where  (; θ) ∈R denotes a density function and θ its unknown parameters.
This transformed probability set up defines what is known as the probability model
Φ= { (; θ) θ∈Θ ∈R }  which is one of two components defining a statistical
model, the other is known as the sample model.
Particular examples of probability models.
o
n
o
n
2
1
Normal:
Φ=  (; θ)= √2 exp − (−)  θ:=( 2 )∈R×R+  ∈R
2
2
o
n
−1 (1−)−1
Beta:
Φ=  (; θ)= []  θ:=( ) 0 0 0 ≤  ≤ 1
Bernoulli: Φ= n (; ) =  (1 − )1−  0    1  = 0 1}
{
o
− 
Poisson:
Φ=  (; ) =  !    0  = 0 1 2 3 

Example 1. Let us return to Example 4, where the random variable ()=1 ()=0
then the relevant probability is the Bernoulli as given above. That is, we can use this
to model the experiment of casting two dice and observing whether the sum of the
dots is even or odd, without having to assume that we have perfectly symmetric and
homogeneous dice because we can leave  as an unknown.

4

Useful probability concepts

4.1

Discrete vs. continuous random variables

The above definition of a random variables is confined to functions that take only
discrete values.
(a) For a countable  the function () :  → R is a random variable (r.v.)
relative to a particular event space of interest F if:
{ : () = }∈F for all ∈R.
In words, if all the events { :  = } for any real number belong to F then ()
is a r.v. relative to this F Intuitively,  is a numbering mapping that preserves the
event structure of interest F
16
(b) For an uncountable  the function () :  → R is a random variable relative
to a particular event space of interest F if:
{ : () ≤ }∈F for all ∈R.

That is, in this case the events of interest are defined by { :  ≤ } In this case
the connection between P() and  () is more complicated because it goes through
the cumulative distribution function  ():
R
P({ : () ≤ })= ()= −∞  () for all ∈R.
Note that under certain regularity conditions:  ()=  () 


4.2

Density function
Properties of a density function
Continuous r.v. 
Discrete r.v. 
f1
f2
f3

 () ≥ 0 for all ∈R 

R∞

−∞

 () ≥ 0 for all ∈R 

 ()=1

 () −  ()=

R


P

 ()=1

∈R

 ()

 ()− ()=


P

 () 

=

However, there is one crucial difference that these properties do not bring out:
discrete:
 () : R → [0 1]
continuous:  () : R → [0 ∞)
That is, in the discrete r.v. case one can think of a density function as assigning
probabilities to all events of the form { = } ∈R  but in the continuous r.v.
case it does not because  () can be bigger than one! Hence, in the continuous case
we can think of the assigning of probabilities over a small interval ( ≤  ≤  + ):
P( ≤  ≤  + ) '  () for some small   0,

where  → 0 is the infinitesimal interval we use in calculus.
Example. Consider the function: ()=2  0 R   1 This function satisfies

1
f1 if   0 but to satisfy f2 we need to ensure that 0  ()=1 For that to be the
R1 2
R1 2
case we solve 0  =1 for  Since 0  = 1 → =3 i.e. ()=32  0    1
3

4.3

Moments of distributions

For modeling and inference purposes we need to relate the unknown parameter(s) θ
to certain features of the assumed density function (; ) ∈R  and the best way
to do that is to relate  to the moments of (; ) because it makes modeling and
inference easier to handle.
17
(a) Raw moments: (  )=
(continuous), =1 2 

R
  (; ) (discrete), or (  )=  (; )
∈R

P

Mean. The most widely used raw moment is the mean:
R
P
()= ∈R (; ) ()= ∈R (; )

Example 1. In the case of the simple Bernoulli distribution:

0
1

 (; )
1−


P
()= ∈R (; )=(0)(1−) + (1) = 
Example 2. Consider the sum of two dice =(1 +2 ) distribution:

 ()

2

3

4

5

6

7

8

9

10 11 12

1
36

2
36

3
36

4
36

5
36

6
36

5
36

4
36

3
36

2
36

1
36

Table 4 - Probability distribution for the sum of two dice
1
2
3
4
5
6
() = 2( 36 ) + 3( 36 ) + 4( 36 ) + 5( 36 ) + 6( 36 ) + 7( 36 )+
5
4
3
2
1
+8( 36 ) + 9( 36 ) + 10( 36 ) + 11( 36 ) + 12( 36 ) = 70

That is, =7 is not just the number with the highest probability, it is also the mean
of the distribution.
Example 3. For the density function  ()=32  0    1 :
R1
()= 0 (32 ) = 75
P
(b) Central moments: ([ − ()] )=
[−()]  (; ),
∈R
R
or ([−()] )= ∈R [−()]  (; ), =2 
(i) Variance. The most widely used central moment is the variance:
P
 ()=([ − ()]2 )= ∈R [ − ()]2  (; )

Example 1. In the case of the simple Bernoulli distribution:
P
 ()= ∈R (−)2  (; )=(1−)2  + (0−)2 (1−)=(1−)

Example 2. In the case of the sum of two dice =(1 +2 ) distribution:
1
2
3
4
5
6
 () =(2-7)2 ( 36 )+(3-7)2 ( 36 )+(4-7)2 ( 36 )+(5-7)2 ( 36 )+(6-7)2 ( 36 )+(7-7)2 ( 36 )+
5
4
3
2
1
+(8-7)2 ( 36 )+(9-7)2 ( 36 )+(10-7)2 ( 36 )+(11-7)2 ( 36 )+(12-7)2 ( 36 )=5.833

Example 3. For the density function  ()=32  01 :
R1
R1
 ()= 0 [−()]2  ()= 0 (−75)2 (32 )=0375
18
p
(ii) Standard deviation:
 () This measure can be used to render a
random variable  unit-less in the sense that the ratio √ 
does not depend on
 ()

the units  is measured!
Warning: Statistics textbooks often flout the distinction between Plato’s world
and the real world by using the same term to refer to two very different concepts, as
shown below.

mean:

Plato’s world
P
 = ∈R (; )

Real world

P
1
 = 

=1


P
P
1
variance:  2 = ∈R ( − )2  (; ) 2 = 
( − )2
=1

5

Important concepts
Deterministic vs. stochastic (chance) regularities
Chance regularities: Distributional, Dependence and Heterogeneity
Random Experiment (E)
Set of all possible outcomes ()
Set of events of interest and related events (F)
Probability set function (P)
Random variables (())
Field generated by a random variable 
Density function (())
Probability model ({ (; θ) θ∈Θ ∈R })

Mean (()), variance ( ())

19

Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference

  • 1.
    PHIL 6334 -Probability/Statistics Lecture Notes 1: Introduction to Probability and Statistical Inference Aris Spanos [Spring 2014] 1 Introduction 1.1 Important distinctions 1. The first distinction that needs to be kept in mind when dealing with probability and inference is that between: (i) the real world of actual data, (ii) Plato’s world of mathematics Conflating the two can easily lead to serious category mistakes. 2. The second distinction pertains to two different types of information in empirical modeling: [i] statistical information reflected in chance regularity patterns, [ii] substantive subject matter information This distinction is important because the primary objective of empirical modeling is ‘to learn from data’ about observable stochastic phenomena of interest using statistical models. A statistical model aims to account for all the systematic statistical information in the data and functions as a mediator between the data and the phenomenon of interest. Substantive subject matter information sharpens and enhances this learning from data when it does not contradict the statistical information. The ultimate aim is to blend the two sources of information after validating that information vis-a-vis the data. 3. The third distinction within probability theory pertains to the basic concepts defining the mathematical set up: (a) ( F P()): events and probabilities, e.g. P(|) ∈F ∈F (b) {(; ) ∈Θ ∈R } : random variables and probabilities In statistical modeling the appropriate set up to be used is (b), because real-world data are usually expressed in terms of numbers. However, people who have been trained to think in terms of (a) often conflate events with random variables and statistical hypotheses, leading to serious muddiness in discussions of inference. Example 1: "what is the probability that Joan has the disease given that she tested positive?" is an event, not a legitimate hypothesis for frequentist testing. Example 2: be suspicious of notation like  (|) or  (|) where  and  denote evidence and a hypothesis;  is never an event in frequentist inference; see Spanos, A. (2010), “Is Frequentist Testing Vulnerable to the Base-Rate Fallacy?” Philosophy of Science, 77: 565-583. 1
  • 2.
    1.2 Dual role ofprobability theory Probability theory is part of mathematics that provides the foundation and the overarching framework for: (1) Statistical modeling: the mathematical concepts needed to model (describe) chance regularity patterns in the form of a statistical model M (x) : a set of probabilistic assumptions specifying a generating mechanism that could have given rise to the data x0 := (1  2       ), and (2) Statistical Inference: the framework for model-based (M (x)) statistical inference - estimation (point and interval), testing, prediction and policy analysis. Probability theory provides ways to ‘calibrate’ the reliability and optimality of inference procedures using error probabilities. 2 Phenomena amenable to statistical modeling Stochastic: observable phenomena which exhibit chance (non-deterministic) regularities are amenable to statistical modeling. Such phenomena vary from simple games of chance (tossing coins, casting dice, playing the roulette, etc.), to highly complicated experiments in physics and chemistry, as well as observable phenomena in economics, astronomy, geology, the biological and social sciences. • Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T). • Example 2. Sampling with replacement from an urn which contains red (R) and black (B) balls. • Example 3. Observing the gender (B or G) of newborns during a certain period (a month) in NY city. • Example 4. Casting two dice and adding the dots of the two sides facing up. • Example 5. Tossing a coin twice and noting the outcome. • Example 6. Tossing a coin until the first "Heads" occurs. • Example 7. Counting the number of emergency calls to a regional hospital during a certain period (a week). ––— • Example 8. Sampling without replacement from an urn which contains red (R) and black (B) balls. • Example 9. Observing the daily changes of the Dow Jones (D-J) index during a certain period. • Example 10. Observing the annual average global surface air temperature 1856-2011. 2
  • 3.
    2.1 Chance regularity vs.deterministic regularity Deterministic regularity can be fully described using mathematical functions among variables, e.g.  = cos( ),  = 1 2      T im e S e r ie s P lo t o f y 1 .0 y 0 .5 0 .0 - 0 .5 - 1 .0 1 50 100 150 200 250 In d e x 300 350 400 450 500 Fig. 1: t-plot of  = cos( ) Stochastic regularity: data x0 :=(1  2       ) (table 1) shown in fig.2.   P lo t o f x 12 10 x 8 6 4 2 1 10 20 30 40 50 In d e x 60 70 80 90 100 Fig. 2: t-plot of data that exhibit chance regularities Table 1 - Observed data on ––– 3 10 11 5 6 7 10 8 5 11 2 9 9 6 7 8 5 4 6 11 7 10 5 11 8 9 5 7 3 4 9 10 3 6 9 7 5 8 5 7 7 6 12 9 10 8 4 7 6 5 12 8 7 5 9 8 10 2 7 3 8 10 10 4 7 4 6 9 6 12 8 11 9 6 2 9 6 4 7 8 10 5 4 8 6 5 4 7 8 7 6 8 7 9 6 7 11 7 8 3 Just glancing through the numbers one cannot easily detect chance regularity patterns, hence the various graphical techniques designed to bring out such patterns. Chance regularities: [D] Distribution, [M] Dependence, [H] Heterogeneity 3
  • 4.
    [D] Distribution chanceregularities Thought experiment 1. Think of the observations as little squares with equal area and rotate the figure 2 clockwise by 90◦ and let the squares representing the observations fall vertically creating a pile on the -axis. [1] Distribution: when  is large enough the data form a perceptible shape (histogram). Fig. 3: A typical realization of a NIID process Fig. 4: Assessing the Distribution assumption 4
  • 5.
    Let us returnto the data in table 1.   P lo t o f x 12 10 x 8 6 4 2 1 10 20 30 40 50 In d e x 60 70 80 90 100 Fig. 2: t-plot of data that exhibit chance regularities Applying thought experiment 1 to the data depicted in figure 2 yields the histogram given below.   H is togr am of x 18 16 14 Frequency 12 10 8 6 4 2 0 2 4 6 8 10 12 x Fig. 5: The histogram of the data in fig. 2 As argued below, the above data x0 := (1  2       ) could be viewed as a typical realization of an Independent and Identically Distributed (IID) process {1  2     } generically denoted by {  ∈N = (1 2   )} whose distribution appears to be symmetrically triangular. How do we assess IID using graphical techniques? Assessing [M] Dependence chance regularities Thought experiment 2. Hide the observations following a certain value of the index, say  = 40, and try to guess the next outcome. Repeat this along the observation index axis and if it turns out that it is impossible to use the previous observations to guess the value of the next observation, excluding the extreme cases 2 and 12, then the chance regularity pattern we call independence is present. Intuitively, we can summarize this notion: 5
  • 6.
    [2] Independence: inany sequence of trials the outcome of any one trial does not influence and is not influenced by that of any other. Fig. 6: A typical realization of a positively dependent process Fig. 7: A typical realization of a negatively dependent process T y p ic a l re a liz a t io n o f a B e rn o u lli I I D p ro c e s s 1 .0 0 .8 x 0 .6 0 .4 0 .2 0 .0 1 10 20 30 40 50 Inde x 6 60 70 80 90 100
  • 7.
    T y pic a l re a liz a t io n o f a d e p e n d e n t B e rn o u ll p ro c e s s 1 .0 0 .8 z 0 .6 0 .4 0 .2 0 .0 1 10 20 30 40 50 Inde x 60 70 80 90 100 What can one say about dependence for the data in table 1 (fig. 2)? They exhibit no dependence patterns. Assessing [H] Heterogeneity chance regularities Thought experiment 3. Take a wide frame (to cover the spread of the fluctuations in a t-plot such as figure 2) that is also long enough (roughly less than half the length of the horizontal axis) and let it slide from left to right along the horizontal axis looking at the picture inside the frame as it slides along. In the case where the picture does not change significantly, the data exhibit homogeneity. Another way to view this pattern is in terms of the arithmetic average and the variation around this average of the numbers as we move from left to right. It appears as though this sequential average and its variation are relatively constant around 7. The variation around this constant average value appears to be within constant bands. This chance regularity can be intuitively summarized by the following notion: [3] Homogeneity: the probabilities associated with the various outcomes remain identical for all trials. Fig. 8: A typical realization of a NIID process 7
  • 8.
    Fig. 9: Atypical realization of a mean-heterogeneous process What can one say about heterogeneity for the data in table 1 (fig. 2)? The data exhibit no heterogeneity patterns.   P lo t o f x 12 10 x 8 6 4 2 1 10 20 30 40 50 In d e x 60 70 80 90 100 Fig. 2: t-plot of data that exhibit chance regularities The data in fig. 2 should be contrasted with the data in figures 10-11. t- P lo t o f D o w J o n e s In d e x 12000 Dow Jones Index 10000 8000 6000 4000 2000 1 253 506 759 1012 1265 t im e 1518 1771 2024 Fig. 10: t-plot of the Dow Jones Index 8 2277
  • 9.
    T e mp e r a tu r e d a ta 0 .5 0 y 0 .2 5 0 .0 0 - 0 .2 5 - 0 .5 0 1856 1880 1904 1928 Year 1952 1976 2000 Fig. 11: Surface air temperature, annual series The data sets depicted in fig. 10-11 exhibit departures from the IID assumptions; they indicate both temporal dependence and heterogeneity. 2.2 Empirical regularities → probabilities The question that naturally arises is whether the available substantive information pertaining to the mechanism that gave rise to the data in fig. 2 would affect the choice of a statistical model. Common sense suggests that it should, but it is not clear what the extent of that influence should be. Let us discuss that in stages. 2.2.1 The Data Generating Mechanism (DGM) The data in table 1 (fig. 2) were generated by a sequence of =100 trials of casting two dice and adding the dots of the two sides facing up. This game of chance was very popular in medieval times with soldiers waiting for weeks on end to assail European cities they had under siege. After thousands of trials these illiterate soldiers learned empirically that the number 7 occurs more often than any other number and that 6 occurs less often than 7 but more often than 5 etc. Historically, the formalization process from chance regularity patterns to probabilities has taken several centuries to come together. [a] All possible distinct outcomes are known a priori. The first crucial feature of the DGM is its stochastic nature: at each trial (the casting of two dice) the outcome (the sum of the dots of the sides) cannot be predicted with any certainty. The only thing one can say for sure is that the result of each trial will be one of the events (numbers): {2 3 4 5 6 7 8 9 10 11 12}  9
  • 10.
    [b] in anyparticular trial the outcome is not known a priori, but there exists a perceptible regularity of occurrence associated with these outcomes. The second key feature of the DGM is that, under certain conditions, these events do not occur equally often in this game of chance. It was obvious to the medieval soldiers that certain events like 7 occur more often than others like 12. What are these conditions? The third feature of the generating mechanism is: [c] it can be repeated under (roughly) identical conditions. These conditions are of paramount importance in modeling stochastic phenomena because they pertain to the validity of the premises of inference. In this case they pertain to the physical symmetry of the two dice and the homogeneity of the replication process. In the actual experiment the dice were cast in the same wooden box to secure a certain form of nearly identical circumstances for each trial. The question that naturally arose at the time was whether one could explain (account for) the differences in the empirical relative frequency of occurrence of these results shown by the histogram (fig. 5) using some kind of mathematical argument. If one could figure out the odds for different events, one could make money! The first systematic account of the underlying mathematical reasoning behind fig. 5 was given by Girolamo Cardano, a notorious gambler, during the 16th century in Italy. 2.2.2 The probabilities behind the empirical frequencies Cardano was able to get to the probabilities behind the observed relative frequencies in four steps. Step 1. He reasoned that since each die has 6 faces (1 2  6)  casting two dice will give rise to 36 possible (elementary) outcomes associated with the different pairings of these numbers (1  2 ) (table 2), where  denotes the number of dots of the −th die, =1 2 1 2 3 4 5 6 Table 2 - Casting two dice: set of elementary outcomes 1 2 3 4 5 (1,1) (1,2) (1,3) (1,4) (1,5) (2,1) (2,2) (2,3) (2,4) (2,5) (3,1) (3,2) (3,3) (3,4) (3,5) (4,1) (4,2) (4,3) (4,4) (4,5) (5,1) (5,2) (5,3) (5,4) (5,5) (6,1) (6,2) (6,3) (6,4) (6,5) 6 (1,6) (2,6) (3,6) (4,6) (5,6) (6,6) Step 2. He reasoned that if the casting is repeated under similar conditions and the two dice are symmetric and homogeneous, then all elementary outcomes (1  2 ) 1 in table 2 are equally likely to occur, and thus one can attach a probability 36 . Step 3. He reasoned that the events of interest {2 3  12} can be seen as arising from adding 1 +2 for each pair in table 2. He collected the different ways the events 10
  • 11.
    1 + 2can arise from elementary outcomes and formed table 3. (2,3) (2,2) (3,2) (2,1) (3,1) (4,1) (1,1) (1,2) (1,3) (1,4) 2 3 4 5 (3,3) (2,4) (4,2) (5,1) (1,5) 6 (4,3) (3,4) (5,2) (2,5) (6,1) (1,6) 7 (4,4) (5,3) (3,5) (6,2) (2,6) 8 (4,5) (4,5) (5,5) (6,3) (6,4) (6,5) (3,6) (4,6) (5,6) (6,6) 9 10 11 12 Table 3: Distribution of events (Plato’s world) Step 4. He realized that if each of the elementary outcomes in table 2 has 1 probability of occurrence 36  then one can evaluate the probabilities associated with events {2 3  12} by simply adding these probabilities. For instance, we know that the number 2 can arise as the sum of a single combi1 nation of faces: {1 1} - each die comes up 1 hence  (1 +2 =2)= ({1 1})= 36  1 the same applies to the number 12:  (1 +2 =12)= ({6 6})= 36  On the other hand the number 3 can arise as the sum of two sets of faces: {(1 2) (2 1)}  hence 2  (1 +2 =3)= ({(1 2) (2 1)})= (1 2) +  (2 1)= 36  The same calculation applies 2 to the number 11:  (1 +2 =11)= ({(6 5) (5 6)})= 36  Similarly, the calculation of probability for number 7 is:  (1 +2 =7)=  ({(1 6) (6 1) (2 5) (5 2) (3 4) (4 3)})= 6  (1 6)+ (6 1)+ (2 5)+ (5 2)+ (3 4)+ (4 3)= 36 Continuing this line of reasoning Cardano was able to construct the probability distribution in table 4 that relates each event of interest in {2 3  12} with its probability of occurrence. Events 2 3 4 5 6 7 8 9 10 11 12 Probabilities 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 Table 4 - Probability distribution for the sum of two dice 2.3 Probabilities → empirical regularities The probability distribution in table 4 represents a probabilistic concept (Plato’s world) formulated by mathematicians in order to capture the distribution chance regularity exhibited by the data in fig. 2 and summarized by the histogram in fig. 5. A direct comparison between figures 5 and 12 confirms the soldiers’ intuition in the sense that the empirical frequencies in fig. 5 are close to the theoretical probabilities shown in table 4 (fig. 12). Moreover, if we were to repeat the experiment, not 100 but 1000 times, these frequencies would have been even closer to the theoretical probabilities. In this sense we can think of the histogram in fig. 5 as an empirical instantiation of the probability distribution in table 4 (fig. 12). 11
  • 12.
    H i sto g r a m o f x 18 16 14 Frequency 12 10 8 6 4 2 0 2 4 6 8 10 12 x Fig. 5: The histogram (real world) ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ 2 3 4 5 ¥ ¥ ¥ ¥ ¥ 6 ¥ ¥ ¥ ¥ ¥ ¥ 7 ¥ ¥ ¥ ¥ ¥ 8 ¥ ¥ ¥ ¥ 9 ¥ ¥ ¥ ¥ ¥ ¥ 10 11 12 Fig. 12: Probability distribution (Plato’s) The probability distribution in table 4 represents an oversimplified simple statistical model where one assumed certain conditions of symmetry, homogeneity and sameness to evaluate the relevant probabilities without the presence of any unknown parameters. In practice one needs to assess some of these preconditions before imposing them and model validation is the first step before the statistical model is used for inference purposes. An important component of model validation is formally test the relationship between fig. 5 and table 4 to ensure that, indeed, the former could have been generated by a statistical model with the latter as the assumed distribution. The step from the observed regularities to their formalization (mathematization) was prompted by the distributional regularity pattern of the data in fig. 2 as exemplified in fig. 5 in the form of a histogram. The formalization itself was initially very slow, taking centuries to materialize, beginning with figuring out simple combinatorial (combinations and permutations) arguments. We can capture the essence of this early formalization by taking the dice casting example one step further. 2.4 Same experiment, different questions of interest When playing the casting of two dice, the medieval soldiers used to gamble on whether the outcome is an odd or an even number (the Greeks introduced these concepts at around 300 B.C.). That is, the soldiers would bet on one of two events: = {3 5 7 9 11}  = {2 4 6 8 10 12} 12
  • 13.
    At first sightit looks as though betting on  will be a sure winner because there are more even than odd numbers. The medieval soldiers, however, knew by empirical observation that this was not true, it was a fair bet! We can confirm their intuition using the distribution in table 4: 2 + 4 + 6 + 4 =1 36 36 36 36 2 1 3 5 5 3 1 = (2)+ (4)+ (6)+ (8)+ (10)+ (12)= 36 + 36 + 36 + 36 + 36 + 36 = 1 2  () = (3) +  (5) +  (7) +  (9) +  (11) =  () This gives rise to the distribution in table 5. Events Probabilities = {3 5 7 9 11} = {2 4 6 8 10 12} 1 2 1 2 Table 5- The probability distribution of odd (A) and even (B) This is an important aspect of modeling because it determines the particular statistical model that can be used to answer the questions of interest. 3 Stochastic phenomenon → a probability model Example 1. Tossing a coin and noting the outcome: Heads (H) or Tails (T). Example 2. Sampling with replacement from an urn which contains red (R) and black (B) balls. Example 3. Observing the gender (B or G) of newborns during a certain period (a month) in NY city. Example 4. Casting two dice and adding the dots of the two sides facing up. Example 5. Tossing a coin twice and noting the outcome. Example 6. Tossing a coin until the first "Heads" occurs. Example 7. Counting the number of emergency calls to a regional hospital during a certain period (a week). Examples 1-7 constitute the simplest form of a stochastic phenomenon, which can be both experimental (1-2, 4-6) and observational (3,7), described as a Random Experiment (E): [a] All possible distinct outcomes are known a priori, [b] in any particular trial the outcome is not known a priori, but there exists a perceptible regularity of occurrence associated with these outcomes, [c] it can be repeated under (largely) identical conditions. What renders them simple is condition [c] which give rise to what are known as random samples. Examples 8-10 (section 2) constitute more complicated stochastic phenomena, both experimental (8) and observational (9-10), whose outcomes cannot be realistically viewed as random (IID) samples. 13
  • 14.
    3.1 Mathematical framing I:Events and Probabilities [a] Formalizing "all possible distinct outcomes" This is achieved by defining the set  of all possible outcomes. Examples: 1 = {  } 2 = { } 3 = { } 4 = {2 3  12} 5 = {( ) (  ) ( ) (  )} 6 = {          } 7 = {0 1 2 } 8 = { } 9 = R:=(−∞ ∞) 10 = R+ :=(0 ∞) [b] Formalizing the "perceptible regularity" The formalization begins with the crucial distinction: outcome:  ∈  vs. event:  ⊂  That is, an outcome is an element of  but an event is a subset of  i.e. an event is any combination of outcomes in  Step 1: We define the set F of events of interest and related events associated with the experiment comprising a set of subsets of . Examples: F1 :={ ∅   } where  denotes the sure event, ∅ the impossible event, F2 :={ ∅  } F3 :={ ∅  } F4 :={ ∅  } ={3 5 7 9 11} ={2 4 6 8 10 12} assuming one is interested in an odd or an even number. F5 :={ ∅  } ={() ( )} ={(  ) ( )} F8 :={ ∅  }. Mathematically F is a field: a set of subsets of  which is closed under the set theoretic operations of union (∪), intersection (∩), and complementation (− ). Examples: In the case of F4 :={ ∅  } ∪=∈F4  ∩=∅∈F4  =∈F4  =∈F4  Step 2: We assign probabilities to all the elements of F using: P(): F → [0 1] a set function which satisfies the axioms: [A1] P() = 1, [A2] P() ≥ 0, for any event ∈F, [A3] For  and  in F such that  ∩ =∅, P( ∪ )=P()+P() The probability space ( F P()) provides an idealized description of the stochastic mechanism that gives rise to the events of interest and related events F, with P() assigning probabilities to events in F Intuitively, an event is something that might or might not occur in a specific trial, and thus the set theoretic operations ∪, ∩, − give rise to related events. Example 1. In the case of Example 1 [Tossing a coin and noting the outcome], the probabilities assigned are: P()=1 P(∅)=0 P()= P( )=1− 0 ≤  ≤ 1 14 (1)
  • 15.
    3.2 Mathematical framing II:Random Variables and Probabilities The above mathematical formulation based on ( F P()) is highly abstract and not particularly convenient for modeling purposes because data often come in the form of numbers! For modeling purposes we need recast the above mathematical formalization into a formulation that relates the events of interest to real numbers. The key to such a recasting is provided by the notion of a random variable. A random variable  is a real-valued function from the set of outcomes to the real line: ():  → R that upholds the event structure of interest in F. That is, each event defined by { : ()=} belongs to F for all ∈R Example 4. In light of the fact that F4 :={ ∅  } ={3 5 7 9 11} ={2 4 6 8 10 12} the real-valued function: ()=1 ()=0 defines a random variable with respect to F4 because the events generated by  are always elements of F4  since: {: ()=1}=∈F4  {: ()=0}=∈F4  {: ()=}=∅∈F4  for all  6= 0 or  6= 1 Example 5. In light of the fact that 5 ={( ) (  ) ( ) (  )} and F:={ ∅  } ={() ( )} ={(  ) ( )}, the real-valued function  :  ()= (  )=0  ( )=1  ( )=2 is not a random variable with respect to F (above) because we can show that some events generated by  () do not belong to F5  e.g. { :  ()=0}={() (  )}∈F5   However, this does not preclude the possibility of defining a different field relative to which  constitutes a random variable. How? First, generate the primary events of interest associated with  : {:  =0}={() (  )}:= {:  =1}={( )}= {:  =2}=( )}:= and, second, use these primary events (  ), to define the related events specifying the field generated by  : ( )={ ∅     ∪   ∪   ∪ } Verify that ( ) is closed under the set theoretic operations of union (∪), intersection (∩), and complementation (− )! In the where () is a random variable relative to F, all events {: ()=} for all ∈R belongs to F, thus one can attach probabilities via: P({ : =})=() for all ∈R 15
  • 16.
    where now wetraded the set function P(): F →[0 1] with the density function (): R →[01] which is a point to point numerical function. In Example 4 above, the density function is the Bernoulli: ½ 1− for =0  (; )=  or (; )= (1 − )1−  =0 1  for =1 3.2.1 The notion of a Probability model Using the concept of a random variable one transforms the original formalization ( F P()) into something more data friendly and richer in mathematical structure: () ( F P()) =⇒ {(; θ) θ∈Θ ∈R } where  (; θ) ∈R denotes a density function and θ its unknown parameters. This transformed probability set up defines what is known as the probability model Φ= { (; θ) θ∈Θ ∈R }  which is one of two components defining a statistical model, the other is known as the sample model. Particular examples of probability models. o n o n 2 1 Normal: Φ=  (; θ)= √2 exp − (−)  θ:=( 2 )∈R×R+  ∈R 2 2 o n −1 (1−)−1 Beta: Φ=  (; θ)= []  θ:=( ) 0 0 0 ≤  ≤ 1 Bernoulli: Φ= n (; ) =  (1 − )1−  0    1  = 0 1} { o −  Poisson: Φ=  (; ) =  !    0  = 0 1 2 3  Example 1. Let us return to Example 4, where the random variable ()=1 ()=0 then the relevant probability is the Bernoulli as given above. That is, we can use this to model the experiment of casting two dice and observing whether the sum of the dots is even or odd, without having to assume that we have perfectly symmetric and homogeneous dice because we can leave  as an unknown. 4 Useful probability concepts 4.1 Discrete vs. continuous random variables The above definition of a random variables is confined to functions that take only discrete values. (a) For a countable  the function () :  → R is a random variable (r.v.) relative to a particular event space of interest F if: { : () = }∈F for all ∈R. In words, if all the events { :  = } for any real number belong to F then () is a r.v. relative to this F Intuitively,  is a numbering mapping that preserves the event structure of interest F 16
  • 17.
    (b) For anuncountable  the function () :  → R is a random variable relative to a particular event space of interest F if: { : () ≤ }∈F for all ∈R. That is, in this case the events of interest are defined by { :  ≤ } In this case the connection between P() and  () is more complicated because it goes through the cumulative distribution function  (): R P({ : () ≤ })= ()= −∞  () for all ∈R. Note that under certain regularity conditions:  ()=  ()   4.2 Density function Properties of a density function Continuous r.v.  Discrete r.v.  f1 f2 f3  () ≥ 0 for all ∈R  R∞ −∞  () ≥ 0 for all ∈R   ()=1  () −  ()= R  P  ()=1 ∈R  ()  ()− ()=  P  ()  = However, there is one crucial difference that these properties do not bring out: discrete:  () : R → [0 1] continuous:  () : R → [0 ∞) That is, in the discrete r.v. case one can think of a density function as assigning probabilities to all events of the form { = } ∈R  but in the continuous r.v. case it does not because  () can be bigger than one! Hence, in the continuous case we can think of the assigning of probabilities over a small interval ( ≤  ≤  + ): P( ≤  ≤  + ) '  () for some small   0, where  → 0 is the infinitesimal interval we use in calculus. Example. Consider the function: ()=2  0 R   1 This function satisfies  1 f1 if   0 but to satisfy f2 we need to ensure that 0  ()=1 For that to be the R1 2 R1 2 case we solve 0  =1 for  Since 0  = 1 → =3 i.e. ()=32  0    1 3 4.3 Moments of distributions For modeling and inference purposes we need to relate the unknown parameter(s) θ to certain features of the assumed density function (; ) ∈R  and the best way to do that is to relate  to the moments of (; ) because it makes modeling and inference easier to handle. 17
  • 18.
    (a) Raw moments:(  )= (continuous), =1 2  R   (; ) (discrete), or (  )=  (; ) ∈R P Mean. The most widely used raw moment is the mean: R P ()= ∈R (; ) ()= ∈R (; ) Example 1. In the case of the simple Bernoulli distribution:  0 1  (; ) 1−  P ()= ∈R (; )=(0)(1−) + (1) =  Example 2. Consider the sum of two dice =(1 +2 ) distribution:   () 2 3 4 5 6 7 8 9 10 11 12 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 Table 4 - Probability distribution for the sum of two dice 1 2 3 4 5 6 () = 2( 36 ) + 3( 36 ) + 4( 36 ) + 5( 36 ) + 6( 36 ) + 7( 36 )+ 5 4 3 2 1 +8( 36 ) + 9( 36 ) + 10( 36 ) + 11( 36 ) + 12( 36 ) = 70 That is, =7 is not just the number with the highest probability, it is also the mean of the distribution. Example 3. For the density function  ()=32  0    1 : R1 ()= 0 (32 ) = 75 P (b) Central moments: ([ − ()] )= [−()]  (; ), ∈R R or ([−()] )= ∈R [−()]  (; ), =2  (i) Variance. The most widely used central moment is the variance: P  ()=([ − ()]2 )= ∈R [ − ()]2  (; ) Example 1. In the case of the simple Bernoulli distribution: P  ()= ∈R (−)2  (; )=(1−)2  + (0−)2 (1−)=(1−) Example 2. In the case of the sum of two dice =(1 +2 ) distribution: 1 2 3 4 5 6  () =(2-7)2 ( 36 )+(3-7)2 ( 36 )+(4-7)2 ( 36 )+(5-7)2 ( 36 )+(6-7)2 ( 36 )+(7-7)2 ( 36 )+ 5 4 3 2 1 +(8-7)2 ( 36 )+(9-7)2 ( 36 )+(10-7)2 ( 36 )+(11-7)2 ( 36 )+(12-7)2 ( 36 )=5.833 Example 3. For the density function  ()=32  01 : R1 R1  ()= 0 [−()]2  ()= 0 (−75)2 (32 )=0375 18
  • 19.
    p (ii) Standard deviation: () This measure can be used to render a random variable  unit-less in the sense that the ratio √  does not depend on  () the units  is measured! Warning: Statistics textbooks often flout the distinction between Plato’s world and the real world by using the same term to refer to two very different concepts, as shown below. mean: Plato’s world P  = ∈R (; ) Real world  P 1  =   =1  P P 1 variance:  2 = ∈R ( − )2  (; ) 2 =  ( − )2 =1 5 Important concepts Deterministic vs. stochastic (chance) regularities Chance regularities: Distributional, Dependence and Heterogeneity Random Experiment (E) Set of all possible outcomes () Set of events of interest and related events (F) Probability set function (P) Random variables (()) Field generated by a random variable  Density function (()) Probability model ({ (; θ) θ∈Θ ∈R }) Mean (()), variance ( ()) 19