APM Welcome, APM North West Network Conference, Synergies Across Sectors
Lecture: Joint, Conditional and Marginal Probabilities
1. Joint,
Condi*onal
and
Marginal
Probabili*es
Last
Updated:
24
March
2015
Slideshare:
h7p://www.slideshare.net/marinasan*ni1/mathema*cs-‐for-‐language-‐technology
Mathema*cs
for
Language
Technology
h7p://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/
Marina
San*ni
san5nim@stp.lingfil.uu.se
Department
of
Linguis*cs
and
Philology
Uppsala
University,
Uppsala,
Sweden
Spring
2015
1
2. Acknowledgements
• Several
slides
borrowed
from
Prof
Joakim
Nivre.
• Prac*cal
Ac*vi*es
by
Prof
Joakim
Nivre
• Required
Reading:
– E&G
(2013):
Ch.
5
(pp.
pp.
110-‐114)
– Compendium
(4):
9.2,
9.3,
9.4
– E&G
(2013):
Ch.
5.2-‐5.3
(self-‐study)
• Recommended
Reading:
– Sec5ons
3-‐6
in
Goldsmith
J.
(2007)
Probability
for
Linguists.
The
University
of
Chicago.
The
Department
of
Linguis*cs:
• h7p://hum.uchicago.edu/~jagoldsm/Papers/probability.pdf
2
3. Outline
• Joint
Probability
• Condi*onal
Probability
• Mul*plica*on
Rule
• Marginal
Probability
• Bayes
Law
• Independence
3
4. Linguis*c
Note:
• Tradi*onally,
the
plural
is
dice,
but
the
singular
is
die.
(i.e.
1
die,
2
dice.)
• Modern
lexicography
says:
ex,
MacMillan:
– h7p://www.macmillandic*onary.com/dic*onary/bri*sh/dice_1
5. Joint
vs
Condi*onal
In
many
situa*ons
where
we
want
to
make
use
fo
probabili*es,
there
are
dependencies
between
different
variables
or
events.
For
this
reason
we
need
the
no*on
of
condi*onal
probability,
ie
the
probabability
of
an
event
given
some
other
event.
the
condi5onal
probability
of
A
given
B
is
defined
as
the
probability
of
the
intersec*on
of
A
and
B
divided
by
the
probability
of
B.
the
probability
of
the
intersec*on
is
referred
to
as
the
joint
probability
because
it
is
the
probability
that
both
A
and
B
occur.
CONDITIONAL
=
NOT
SYMMETRICAL
5
6. Condi*onal
When
we
talk
about
the
joint
probability
of
A
and
B,
then
we
are
considering
the
intersec*on
of
A
and
B,
ie
those
outcomes
that
are
both
in
A
and
B.
And
we
ask:
how
large
is
that
set
of
events
compared
to
the
en*re
sample
space?
6
7. Example:
Bigrams
10-‐3
=
1/103=1/1000=
one
in
thousand
one
in
one
million
joint
probability
=
one
in
10
millions
We
apply
the
formula
of
condi*onal
probability
7
8. From
the
defini*on
of
condi*onal
probability
we
can
derive
the
Mul*plica*on
Rule
8
One
way
to
compute
the
probability
of
A
and
B
(ie
the
joint
probability)
is
to
take
the
probability
of
B
by
itself
and
mul*ply
it
with
the
probability
of
A
given
B.
Another
way
to
compute
the
joint
probability
of
A
and
B
is
to
start
with
the
simple
probability
of
A
and
mul*ply
that
by
the
probability
of
B
given
A
9. Quiz
1:
only
one
answer
is
correct
9
Probability
is
the
measure
of
the
likeliness
that
an
event
will
occur.
The
higher
the
probability
of
an
event,
the
more
certain
we
are
that
the
event
will
occur.
10. Quiz
1:
Solu*on
1.
Smaller
than
1
in
a
million
—
correct
[P(A,
B)
=
0.00001(=100
000)
0.000001(=1
million)x
0.0001
(=10
000)
<
0.000001;
P
is
1
in
10
million]
2.
Greater
than
1
in
a
million
—
incorrect
[P(A,
B)
=
0.00001(=100
000)
0.000001(=1
million)x
0.0001
(=10
000)
<
0.000001;
P
is
1
in
10
million]
3.
Impossible
to
tell
—
incorrect
[Given
P(A
|
B)
and
P(B),
we
can
derive
P(A,
B)
exactly.]
10
11. Quiz
1:
only
one
answer
is
correct
11
We
apply
the
following
mul*plica*on
rule:
P(A,B)=P(B)P(A|B),
since
we
know
these
elements:
P(B)
(i.e
1/10
000
=
0.0001)
;
P(A|B)
(i.e
1/1
000
000
=
0.000001)
P(A,B)=P(B)P(A|B)
=
0.0001
*
0.000001
=
0.0000000001
(=
10
000
000
000
=
10
billions)
Result:
the
intersec*on
of
A
and
B
(ie
people
having
BOTH
a
PhD
in
physics
and
winning
a
nobel
prize)
is
1
in
10
billions
1:
is
the
probability
of
1
in
10
billions
smaller
than
1
in
1
million
?
yes!
0.0000000001
is
smaller
than
0.000001
2:
is
the
probability
of
1
in
10
millions
greater
than
1
in
1
million
?
NO!
0.0000000001
is
NOT
smaller
than
0.000001
3:
impossible
to
predict:
INCORRECT!
it
is
possible
to
predict
the
probability
because
you
have
all
the
elements
to
apply
the
mul*plica*on
rule.
14. Introduc*on
to
the
concept
of
Marginaliza5on
14
par**on
means:
events
are
disjoint,
ie
they
do
not
have
members
in
common.
In
other
words:
their
intersec*on
is
empty;
their
union
is
the
en*re
sample
space.
This
a
way
to
divide
the
sample
space
in
non-‐overlapping
events.
Pairwise
comparison
generally
refers
to
any
process
of
comparing
en**es
in
pairs…
Given
that
we
have
some
par**ons
and
given
that
we
are
interested
in
another
event
A
in
the
same
sample
space,
then
we
can
compute
the
probability
of
A
by
summing
up
all
the
joint
probabili*es
with
A
to
each
member
of
the
par**on
(this
is
the
summa*on
formula
in
the
middle).
15. …
con*nued…
15
All
this
seems
a
very
strange
method
because
we
are
compu*ng
something
very
simple,
ie
the
probability
of
A,
from
something
more
complex
involving
summa*on,
joint
probabili*es
and
condi*onal
probabili*es.
But
this
is
something
that
is
very
useful
in
situa5ons
where
we
do
not
know
the
probability
of
A
but
we
know
the
joint
or
the
condi5onal
probabili5es
of
A
with
the
members
of
a
par55on.
Knowing
the
mul*plica*on
rule,
we
also
know
that
the
joint
probability
of
A
and
Bi
can
be
expressed
as
the
condi*onal
probability
of
A
given
Bi
*mes
the
simple
probability
of
Bi.
Marginal
probability
Mul*plica*on
rule
16. Joint,
Marginal
&
Condi*onal
Probabili*es
16
What
is
important
is
to
understand
the
rela*on
between
the
joint,
the
marginal
and
the
condi*onal
probabili*es,
and
the
way
we
can
derive
them
from
each
other.
In
par*cular,
given
that
we
know
the
joint
probabili*es
of
the
events
we
are
interested
in,
we
can
always
derive
the
marginal
and
condi*onal
probability
from
them,
whereas
the
opposite
does
not
hold
(except
in
some
special
condi*ons).
sum
up
to
1
What
if
we
want
the
simple
probabili*es?
Once
we
have
the
joint
probabili*es
and
the
simple
probabili*es,
we
can
combine
these
to
get
condi*onal
probabili*es.
18. Bayes
Law
18
Given
events
A
and
B
in
the
sample
space
omega,
the
condi*onal
probability
of
A
given
B
is
equal
to
the
simple
probability
of
A
*mes
the
inverse
condi*onal
probability,
ie
the
probability
of
B
given
A
divided
by
the
simple
probabiity
of
B.
We
know
thanks
to
the
mul*plica*on/chain
rule
that
the
joint
probabili*es
can
be
replaced
by
the
simple
probability
mul*plied
by
the
condi*onal
probability.
Bayes
Law
is
a
powerful
tool
that
allows
us
to
invert
condi5onal
probability.
When
we
find
ourselves
in
a
situa*on
where
we
need
to
know
the
probability
of
A
given
B,
but
our
data
gives
us
only
the
probability
of
B
given
A,
we
can
invert
the
expression
and
get
the
probabili*es
that
we
need
(
a
li7le
bit
more
on
this,
next
*me)
19. Independence
19
Two
events
A
and
B
independent
if
and
only
if
the
joint
probability
of
A
and
B
is
equal
to
the
simple
probability
of
A
mul*plied
by
the
simple
probability
of
B.
This
is
equivalent
to
say
that
the
probability
of
A
by
itself
is
equal
to
the
condi*onal
probability
of
A
given
B.
Or
viceversa
that
the
simple
probability
of
B
is
equal
to
the
probability
of
B
given
A.
One
way
to
think
of
this
is
to
say
that
if
two
events
are
independent,
knowing
that
one
of
them
has
occurred
does
not
give
us
any
new
informa*on
about
the
other
event,
because
the
condi*onal
probability
is
the
same
as
the
simple
probability.
22. Quiz
2:
Solu*ons
(Joakim’s
original)
1. The
probability
is
0.1
—
incorrect
[We
cannot
compute
P(A
|
B)
from
P(B
|
A)
without
addi*onal
informa*on.]
2.
The
probability
is
0.9
—
incorrect
[We
cannot
compute
P(A
|
B)
from
P(B
|
A)
without
addi*onal
informa*on.]
3.
Nothing
—
correct
[We
cannot
compute
P(A
|
B)
from
P(B
|
A)
without
addi*onal
informa*on.]
22
23. Quiz
2:
Solu*ons
1. The
probability
is
0.1
—
incorrect
[We
cannot
compute
P(Dis|Sym)
from
P(Sym|Dis)
without
addi*onal
informa*on.]
2. The
probability
is
0.9
—
incorrect
[We
cannot
compute
P(Dis|Sym)
from
P(Sym|Dis)
without
addi*onal
informa*on.]
3. Nothing
—
correct
[We
cannot
compute
P(Dis|
Sym)
from
P(Sym|Dis)
without
addi*onal
informa*on.]
23
24. Break
down
• P(Sym|Dis)
=
0.9
à
P(B|A)=0.9
• P(Dis|Sym)
=
?
à
P(A|B)=?
• Bayes:
• P(A|B)=
P(A)
P(B|A)
/
P(B)
• P(A)=?
• P(B)=?
24
We
need
additonal
info,
ie
P(A)
and
P(B)
Can
we
use
marginaliza
;on/Law
of
Total
Probability
to
derive
(A)
and
P(B)?
Total
number
of
individual
outcomes