Upcoming SlideShare
×

# Probability 1011

2,648 views
2,454 views

Published on

an embarrasingly simple intro to probability

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,648
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
15
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Probability 1011

1. 1. Elements of Mathematics: an embarrasignly simple (but practical) introduction to probability Jordi Vill` i Freixa (jordi.villa@upf.edu) a November 23, 2011 Nota de classe Amb notes de classe!Contents 1
2. 2. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Most of the material for this report has been modiﬁed from [?, ?]. See also [?]. Nota de classe Proposar l’experiment de tirar dos cops el mateix dau. Un experiment ´s un trial i pot donar, ecom outcome, qualsevol parella de 1 a 6; El Sample space cont´ 12 trials; jugar a posar-hi events en eforma gr`ﬁca a1 Venn diagrams • We call a single performance of an experiment a trial and each possible result outcome. • The sample space Ω of the experiment is the set of all possible outcomes of an individual trial • Sample spaces can be ﬁnite (throwing a die) or inﬁnite (measuring people’s height) An event is a subset of the sample space. A venn diagram can help us understanding theclassiﬁcation of events (Figure ??). A B i iii ii iv ΩFigure 1: Example of Venn diagram representing two diﬀerent events in a sample space Ω. Everypossible outcome is assigned to an appropriate region. For example, when throwing a six sided die, Ω = {1, 2, 3, 4, 5, 6}the event “getting an odd result” is A ⊂ Ω; A = {2, 4, 6}or for the sample space of all heights of a human population, we may decide to have two events:greater than 180cm (event A) or less than or equal to 180cm (event B).Probability 2
3. 3. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences In the case of the six-sided die, Figure ?? represents two diﬀerent events: • A is the event representing that the number can be divided by 2 • B represents the event that the number can be divided by 3 A B 2, 4 6 3 1, 5 ΩFigure 2: Example of Venn diagram representing two diﬀerent events in a sample space Ω. Everypossible outcome is assigned to an appropriate region. Figure ?? shows how diﬀerent regions can be described by a simple inspection of the Venndiagram. For example, two events are called disjoint or mutually exclusive if their intersection isthe empty event, ∅. Thus, in the case of the six-sided die: A B = {2, 3, 4, 6}. Note too that ¯ ¯ ¯A = Ω − A = {1, 3, 5}, that A A = Ω and that A A = ∅. Using the Venn diagrams, it is easy to see how the following properties hold:commutative A B=B A, and A B=B A;associative (A B) C=A (B C), and (A B) C=A (B C);distributive A (B C) = (A B) (A C) and A (B C) = (A B) (A C); and, ﬁnallyidempotency A A = A and A A = A. Exercise 1show that A (A B) = A (A B) = A and that (A − B) (A B) = A. From the Venn diagrams we can also see how ¯ ¯ • if A ⊂ B then A ⊃ B; • A ¯ B=A ¯ B; • A ¯ B=A ¯ BProbability 3
4. 4. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences A B A B Ω Ω (a) (b) A ¯ A A B Ω Ω (c) (d)Figure 3: The shaded regions show (a) A B, the intersection between two events; (b) A B, the ¯union; (c) the complement A of an event; and (d) A − B, those outcomes in A that do not belongto B.2 ProbabilityPr(A) is deﬁned as the expected relative frequency of event A in a large number of trials. If anexperiment has a total number of possible outcomes nΩ , and nA of them correspond to event A,then: nA Pr(A) = nΩ From this deﬁnition we can deduce a number of properties: 1. For any event A in a sample space Ω, 0 ≤ Pr(A) ≤ 1. nΩ 2. For the entire sample space Ω, Pr(Ω) = nΩ = 1. 3. For any two events A and B in Ω, [Pr(A B) = Pr(A) + Pr(B) − Pr(A B). (1) ¯ 4. (complement law) Pr(A) = 1 − Pr(A).Probability 4
5. 5. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Exercise 2Calculate the probability of drawing an ace or a spade from a pack of cards. Exercise 3 A biased six-sided die has probabilities 3p, 2p, p/2, p, p, 3p of showing 1, 2, 3, 4, 5, 6, respectively.Find p Eq. ?? above can be generalized saying: Pr(A1 A2 ··· An ) = Pr(Ai ) − Pr(Ai Aj ) + Pr(Ai Aj Ak ) i i,j i,j,k − · · · + (−1)n+1 Pr(A1 A2 ··· An ) Exercise 4Find the probability of drawing from a pack a card that has at least one of the following properties: • it is an ace • it is a spade • it is a black honour card (ace, king, queen, jack or 10) • it is a black ace Nota de classe comencem per les probabilitats de cada event individual: 4/52, 13/52, 10/52, 2/52. Desprstrobem les interseccions entre dos events, desprs entre tres, i ﬁnalment totes a l’hora (que resultaser 1/52. La frmula ens dna, ﬁnalment, P=20/52 Exercise 5 John and Mary are taking a mathematics course. The course has only three grades: A, B, and C.The probability that John gets a B is .3. The probability that Mary gets a B is .4. The probabilitythat neither gets an A but at least one gets a B is .1. What is the probability that at least one getsa B but neither gets a C? Nota de classe Let the sample space be: ω1 = {A, A} ω4 = {B, A} ω7 = {C, A} ω2 = {A, B} ω5 = {B, B} ω8 = {C, B} ω3 = {A, C} ω6 = {B, C} ω9 = {C, C}Probability 5
6. 6. MAT: 2011-31035-T1 MSc Bioinformatics for Health Scienceswhere the ﬁrst grade is John’s and the second is Mary’s. You are given that P (ω4 ) + P (ω5 ) + P (ω6 ) = .3, P (ω2 ) + P (ω5 ) + P (ω8 ) = .4, P (ω5 ) + P (ω6 ) + P (ω8 ) = .1.Adding the ﬁrst two equations and subtracting the third, we obtain the desired probability as P (ω2 ) + P (ω4 ) + P (ω5 ) = .6. Finally, the total probability law states that for a series of mutually exclusive events Ai , Pr(B) = Pr(Ai B) i3 Conditional probability and Bayes’ theorem3.1 Conditional probabilitySuppose we assign a distribution function to a sample space and then learn that an event E hasoccurred. How should we change the probabilities of the remaining events? We shall call the newprobability for an event F the conditional probability of F given E and denote it by P (F |E). This is the same than wishing to know the probability of an event given another one. FromPr(A B) = Pr(A)Pr(B|A), we obtain (see, Figure ??): Pr(B A) Pr(B|A) = (2) Pr(A) We can formalize this result in the following manner. Let Ω = {ω1 , ω2 , . . . , ωr } be the originalsample space with distribution function m(ωj ) assigned. Suppose we learn that the event E hasoccurred. We want to assign a new distribution function m(ωj |E) to Ω to reﬂect this fact. Clearly,if a sample point ωj is not in E, we want m(ωj |E) = 0. Moreover, in the absence of informationto the contrary, it is reasonable to assume that the probabilities for ωk in E should have the samerelative magnitudes that they had before we learned that E had occurred. For this we require that m(ωk |E) = cm(ωk )for all ωk in E, with c some positive constant. But we must also have m(ωk |E) = c m(ωk ) = 1 . E EThus, 1 1 c= = . E m(ωk ) P (E)Probability 6
7. 7. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences A B n = 10 n = 3 n = 4 n = 13 ΩFigure 4: Conditional probability: evaluate Pr(A B) = 3/30 is equivalent to the product ofprobabilities Pr(B) = 7/30 times Pr(A|B) = 3/7, or the product of Pr(A) = 13/30 times Pr(B|A) =3/13(Note that this requires us to assume that P (E) > 0.) Thus, we will deﬁne m(ωk ) m(ωk |E) = P (E)for ωk in E. We will call this new distribution the conditional distribution given E. For a generalevent F , this gives m(ωk ) P (F ∩ E) P (F |E) = m(ωk |E) = = . P (E) P (E) F ∩E F ∩E We call P (F |E) the conditional probability of F occurring given that E occurs, and compute itusing the formula P (F ∩ E) P (F |E) = . P (E) Exercise 6 An experiment consists of rolling a die once. If the die has rolled and we are said it was bigger than4, what is the probability of being 5? Nota de classe Let X be the outcome. Let F be the event {X = 6}, and let E be the event {X > 4}. We assignthe distribution function m(ω) = 1/6 for ω = 1, 2, . . . , 6. Thus, P (F ) = 1/6. Now suppose that thedie is rolled and we are told that the event E has occurred. This leaves only two possible outcomes:5 and 6. In the absence of any other information, we would still regard these outcomes to be equallylikely, so the probability of F becomes 1/2, making P (F |E) = 1/2.Probability 7
8. 8. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Exercise 7 In a Life Table, one ﬁnds that in a population of 100,000 females, 89.835% can expect to live to age60, while 57.062% can expect to live to age 80. Given that a woman is 60, what is the probabilitythat she lives to age 80? Nota de classe This is an example of a conditional probability. In this case, the original sample space can bethought of as a set of 100,000 females. The events E and F are the subsets of the sample spaceconsisting of all women who live at least 60 years, and at least 80 years, respectively. We consider Eto be the new sample space, and note that F is a subset of E. Thus, the size of E is 89,835, and thesize of F is 57,062. So, the probability in question equals 57,062/89,835 = .6352. Thus, a womanwho is 60 has a 63.52% chance of living to age 80. In terms of the Venn diagram, we can consider Pr(B|A) to be the probability of B in the reducedsample space deﬁned by A. Thus, for two mutually exclusive events, Pr(A|B) = 0 = Pr(B|A). When randomly drawing an object from the set of total object we say we are sampling the space,and we can do so with replacement or without replacement. Example We have two urns, I and II. Urn I contains 2 black balls and 3 white balls. Urn II contains 1 blackball and 1 white ball. An urn is drawn at random and a ball is chosen at random from it. We canrepresent the sample space of this experiment as the paths through a tree as shown in Figure ??.The probabilities assigned to the paths are also shown. Let B be the event “a black ball is drawn,” and I the event “urn I is chosen.” Then the branchweight 2/5, which is shown on one branch in the ﬁgure, can now be interpreted as the conditionalprobability P (B|I). Suppose we wish to calculate P (I|B). Using the formula, we obtain P (I ∩ B) P (I|B) = P (B) P (I ∩ B) = P (B ∩ I) + P (B ∩ II) 1/5 4 = = . 1/5 + 1/4 9Another way to visualize this is by drawing the table: w b I 3/10 1/5 II 1/4 1/4Probability 8
9. 9. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 5: Tree diagram.which represents the probabilities for the events in the sample space Ω: 3/10 + 1/5 + 1/4 + 1/4 = 1. Exercise 8 Find the probability of drawing two aces if (i) we replace the card after the ﬁrst draw or (ii) we donot. Nota de classe en el primer cas tenim events estad´ ısticament independents (Pr(A B) = Pr(A)Pr(B)), i en elsegon no ho s´n i cal usar l’equaci´ ??. En general, cal pensar en dos events successius com un event o oen forma de dupla. The above exercise shows what are two statistically independent events A and B. This willoccur when Pr(A|B) = Pr(A) or, equivalently, Pr(B|A) = Pr(B) (or at least one of the events hasprobability 0). It can be shown that, for two independent events: Pr(A B) = Pr(A)Pr(B). A set of events {A1 , A2 , . . . , An } is said to be mutually independent if for any subset {Ai , Aj , . . . , Am }of these events we have P (Ai ∩ Aj ∩ · · · ∩ Am ) = P (Ai )P (Aj ) · · · P (Am ), ¯ ¯ ¯ ¯ ˜or equivalently, if for any sequence A1 , A2 , . . . , An with Aj = Aj or Aj , ¯ ¯ ¯ ¯ ¯ ¯ P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 ) · · · P (An ).It is natural to ask: If all pairs of a set of events are independent, is the whole set mutuallyindependent? The answer is not necessarily.Probability 9
10. 10. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Exercise 9A coin is tossed twice. Consider the following events.A: Heads on the ﬁrst toss.B: Heads on the second toss.C: The two tosses come out the same. 1. Show that A, B, C are pairwise independent but not independent. 2. Show that C is independent of A and B but not of A ∩ B. Nota de classe • We have P (A ∩ B) = P (A ∩ C) = P (B ∩ C) = 1/4 P (A)P (B) = P (A)P C) = P (B)P (C) = 1/4 P (A ∩ B ∩ C) = 1/4 = P (A)P (B)P (C) = 1/8 • We have P (A ∩ C) = P (A)P (C) = 1/4 so C and A are independent, P (C ∩ B) = P (B)P (C) = 1/4 so C and B are independent, P (C ∩ (A ∩ B)) = 1/4 = P (C)P (A ∩ B) = 1/8 so C and A ∩ B are not independent. It is important to note that the statement P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P (A2 ) · · · P (An )does not imply that the events A1 , A2 , . . . , An are mutually independent (see exercise ??). Exercise 10Let Ω = {a, b, c, d, e, f }. Assume that m(a) = m(b) = 1/8 and m(c) = m(d) = m(e) = m(f ) = 3/16.Let A, B, and C be the events A = {d, e, a}, B = {c, e, a}, C = {c, d, f }. Show that P (A ∩ B ∩ C) =P (A)P (B)P (C) but no two of these events are independent.Probability 10
11. 11. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Even more, if A is the union of n mutually exclusive events: Pr(A B) = Pr(Ai B) i Pr(A|B) = Pr(Ai |B) iwhich is the addition law for conditional probabilities. If the set of Ai ’s exhaust Ω, then Pr(B) = Pr(Ai )Pr(B|Ai ) (3) i Exercise 11 Suppose that we have a coin which comes up heads with probability p, and tails with probabilityq. Now suppose that this coin is tossed twice. Let E be the event that heads turns up on the ﬁrsttoss and F the event that tails turns up on the second toss. Show how P (E ∩ F ) = P (E)P (F ) Nota de classe Using a frequency interpretation of probability, it is reasonable to assign to the outcome (H, H)the probability p2 , to the outcome (H, T ) the probability pq, and so on. We will now checkthat with the above probability assignments, these two events are independent, as expected. Wehave P (E) = p2 + pq = p, P (F ) = pq + q 2 = q. Finally P (E ∩ F ) = pq, so P (E ∩ F ) =P (E)P (F ). Nota de classe It is often, but not always, intuitively clear when two events are independent. In the aboveexample, let A be the event “the ﬁrst toss is a head” and B the event “the two outcomes are thesame.” Then P (B ∩ A) P {HH} 1/4 1 P (B|A) = = = = = P (B). P (A) P {HH,HT} 1/2 2Therefore, A and B are independent, but the result was not so obvious.3.2 Bayes’ (inverse) probabilityFrom the fact that the probability of both A and B occurring, that can be calculated either byPr(A)Pr(B|A) or Pr(B)Pr(A|B), we obtain the Bayes theorem (in other words, ﬁnding the inverseprobability): Pr(A) Pr(A|B) = Pr(B|A) (4) Pr(B) This theorem also shows that Pr(B|A) = Pr(A|B), unless Pr(A) = Pr(B).Probability 11
12. 12. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 6: Reverse tree diagram. Nota de classe en l’exercici segent, usar la taula extreta de l’arbre en l’exercici ?? Example For the example of the white and black balls, the paths through the reverse tree (Figure ??) arein one-to-one correspondence with those in the forward tree, since they correspond to individualoutcomes of the experiment, and so they are assigned the same probabilities. From the forward tree,we ﬁnd that the probability of a black ball is 1 2 1 1 9 · + · = . 2 5 2 2 20 The probabilities for the branches at the second level are found by simple division. For example,if x is the probability to be assigned to the top branch at the second level, we must have 9 1 ·x= 20 5or x = 4/9. Thus, P (I|B) = 4/9, in agreement with our previous calculations. The reverse treethen displays all of the inverse, or Bayes, probabilities. ¯ ¯ If we take into account that Pr(B) = Pr(A)Pr(B|A)+Pr(A)Pr(B|A) (see Eq. ??), we can rewritethe above formula as: Pr(A) Pr(A|B) = ¯ Pr(B|A) (5) ¯ Pr(A)Pr(B|A) + Pr(A)Pr(B|A)Probability 12
13. 13. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Following we give a more general deduction, for the case of more than one alternative event. We return now to the calculation of more general Bayes probabilities. Suppose we have a set ofevents H1 , H2 , . . . , Hm that are pairwise disjoint and such that the sample space Ω satisﬁes theequation Ω = H1 ∪ H2 ∪ · · · ∪ Hm .We call these events hypotheses. We also have an event E that gives us some information aboutwhich hypothesis is correct. We call this event evidence. Before we receive the evidence, then, we have a set of prior probabilities P (H1 ),P (H2 ), . . . , P (Hm ) for the hypotheses. If we know the correct hypothesis, we know the probabilityfor the evidence. That is, we know P (E|Hi ) for all i. We want to ﬁnd the probabilities for thehypotheses given the evidence. That is, we want to ﬁnd the conditional probabilities P (Hi |E).These probabilities are called the posterior probabilities. To ﬁnd these probabilities, we write them in the form P (Hi ∩ E) P (Hi |E) = . (6) P (E)We can calculate the numerator from our given information by P (Hi ∩ E) = P (Hi )P (E|Hi ) . (7)Since one and only one of the events H1 , H2 , . . . , Hm can occur, we can write the probability of Eas P (E) = P (H1 ∩ E) + P (H2 ∩ E) + · · · + P (Hm ∩ E) .Using Equation ??, the above expression can be seen to equal P (H1 )P (E|H1 ) + P (H2 )P (E|H2 ) + · · · + P (Hm )P (E|Hm ) . (8)Using (??), (??), and (??) yields Bayes’ formula: P (Hi )P (E|Hi ) P (Hi |E) = m . k=1 P (Hk )P (E|Hk ) Although this is a very famous formula, we will rarely use it. If the number of hypotheses issmall, a simple tree measure calculation is easily carried out, as we have done in our examples. Ifthe number of hypotheses is large, then we should use a computer. Bayes probabilities are particularly appropriate for medical diagnosis. A doctor is anxious toknow which of several diseases a patient might have. She collects evidence in the form of theoutcomes of certain tests. From statistical studies the doctor can ﬁnd the prior probabilities of thevarious diseases before the tests, and the probabilities for speciﬁc test outcomes, given a particulardisease. What the doctor wants to know is the posterior probability for the particular disease, giventhe outcomes of the tests. In other words, here: Diseases = Hi , and T estsOutcome = E.Probability 13
14. 14. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Number having The results Disease this disease + + + – – + – – d1 3215 2110 301 704 100 d2 2125 396 132 1187 410 d3 4660 510 3568 73 509 Total 10000 Table 1: Diseases data. d1 d2 d3 + + .700 .131 .169 + – .075 .033 .892 – + .358 .604 .038 – – .098 .403 .499 Table 2: Posterior probabilities. Example A doctor is trying to decide if a patient has one of three diseases d1 , d2 , or d3 . Two tests are tobe carried out, each of which results in a positive (+) or a negative (−) outcome. There are fourpossible test patterns ++, +−, −+, and −−. National records have indicated that, for 10,000 peoplehaving one of these three diseases, the distribution of diseases and test results are as in Table ??. From this data, we can estimate the prior probabilities for each of the diseases and, given aparticular disease, the probability of a particular test outcome (Pr(E|Hi )). For example, the priorprobability of disease d1 may be estimated to be 3215/10,000 = .3215. The probability of the testresult +−, given disease d1 , may be estimated to be 301/3215 = .094. We can now use Bayes’ formula to compute various posterior probabilities. The results for thisexample are shown in Table ??. We note from the outcomes that, when the test result is ++, the disease d1 has a signiﬁcantlyhigher probability than the other two. When the outcome is +−, this is true for disease d3 . Whenthe outcome is −+, this is true for disease d2 . Note that these statements might have been guessedby looking at the data. If the outcome is −−, the most probable cause is d3 , but the probabilitythat a patient has d2 is only slightly smaller. If one looks at the data in this case, one can see thatit might be hard to guess which of the two diseases d2 and d3 is more likely. Our ﬁnal example shows that one has to be careful when the prior probabilities are small. Example A doctor gives a patient a test for a particular cancer. Before the results of the test, the onlyevidence the doctor has to go on is that 1 woman in 1000 has this cancer. Experience has shownthat, in 99 percent of the cases in which cancer is present, the test is positive; and in 95 percentof the cases in which it is not present, it is negative. If the test turns out to be positive, whatprobability should the doctor assign to the event that cancer is present? An alternative form of thisquestion is to ask for the relative frequencies of false positives and cancers.Probability 14
15. 15. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 7: Forward and reverse tree diagrams. We are given that prior(cancer) = .001 and prior(not cancer) = .999. We know also thatP (+|cancer) = .99, P (−|cancer) = .01, P (+|not cancer) = .05, and P (−|not cancer) = .95. Us-ing this data gives the result shown in Figure ??. We see now that the probability of cancer given a positive test has only increased from .001 to.019. While this is nearly a twenty-fold increase, the probability that the patient has the cancer isstill small. Stated in another way, among the positive results, 98.1 percent are false positives, and1.9 percent are cancers. When a group of second-year medical students was asked this question,over half of the students incorrectly guessed the probability to be greater than .5. Exercise 12 Suppose that a blood test for a disease gives a false negative of 0.01% and a false positive of 0.02%.In general, one person in 10000 individuals of the population is infected. What is the probabilitythat, taking an individual from the population and tested positive for the disease, he/she has thedisease, indeed? Nota de classe ¯ cal usar l’equaci´ ?? i substituir-hi: Pr(A) = 1/10000 = 1 − Pr(A), a part de Pr(B|A) = o ¯9999/10000 i Pr(B|A) = 2/10000Probability 15
16. 16. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences4 Permutations and combinationsWe need to be able to count the number of possible outcomes in various common situations: n n! Pk = (9) (n − k)! n n n! Ck = = (10) k (n − k)!k! Exercise 13Find the probability that in a group of k people at least two have the same brithday Nota de classe 365 365! el nombre de cops en els quals l’aniversari ´s diferent ´s: e e Pk = 365−k)! i la probabilitat de que 365!tots els aniversaris siguin diferents ´s: p = 365−k)!365k e Exercise 14 A hand of 13 playing cards is dealt from a well-shuﬄed pack of 52. What is the probability thatthe hand contains two aces? Nota de classe com que no ens importa l’ordre de les cartes, el nombre total de mans diferents s simplement elnombre de combinacions de 13 objectes agafats d’una pila de 52: 52 C13 . No obstant, el nombre demans que contenen 2 asos s igual al producte del nombre de cops que es poden obtenir dels 4 que hiha, 4 C2 multiplicat pel nombre de cops 48 C11 que les restants 11 cartes a la m es poden obtenir deles 48 que no tenen asos: 4 C2 48 C11 52 C = 0.213 135 Random Variables and distributionsThe outcome of an experiment need not be a number, for example, the outcome when a coin istossed can be ’heads’ or ’tails’. However, we often want to represent outcomes as numbers. Arandom variable is a function that associates a unique numerical value with every outcome of anexperiment. The value of the random variable will vary from trial to trial as the experiment isrepeated. Thus, it is possible to assign a probability distribution to any random variable. Randomvariables can be discrete or continuous.Probability 16
17. 17. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences5.1 Discrete random variablesA random variable X that takes only discrete values x1 , x2 , . . . , xn with probabilities p1 , p2 , . . . , pn .A probability distribution or probability function (PF) f (x) assigns probabilities to all the distinctvalues that X may take: pi if x = xi f (x) = Pr(X = x) = (11) 0 otherwise It is obvious that for a discrete random variable n f (xi ) = 1 i=1 We may also deﬁne a cumulative probability function (CPF), F (x), as F (x) = Pr(X ≤ x) = f (xi ) xi ≤xwhich can help obtaining, eg, the probability of getting X lying between two values: Pr(l1 < X ≤ l2 ) = f (xi ) = F (l2 ) − F (l1 ) l1 <xi ≤l2 Exercise 15 A bag contains seven red balls and three white balls. Three balls are drawn at random and notreplaced. Find the probability function for the number of red balls drawn. Nota de classe fer un tree, i immediatament: 3 2 1 1 Pr(X = 0) = f (0) = × × = 10 9 8 120 3 2 7 7 Pr(X = 1) = f (1) = × × ×3= 10 9 8 40 3 7 6 21 Pr(X = 2) = f (2) = × × ×3= 10 9 8 40 7 6 5 7 Pr(X = 3) = f (3) = × × = 10 9 8 245.2 Continuous random variables(See [?] for an example on the analysis of fat tailed distributions of biological networks)Probability 17
18. 18. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences X has a continuous distribution if it is deﬁned for a continuous range of values between givenlimits l1 and l2 (including −∞ to ∞). In this case, the probability density function (PDF) f(x) isdeﬁned as Pr(x < X ≤ x + dx) = f (x)dxIt is also obvious here than l2 f (x)dx = 1 l1The cumulative density function (CDF), F (X), for a continuous random variable is then written as: x F (X) = Pr(X ≤ x) = f (u)du l1and Pr(a < X ≤ b) = F (b) − F (a) Exercise 16 A random variable X has a PDF f (x) = A exp{−x} in ]0, ∞[ and zero elsewhere. Find theprobability that X lies in ]1, 2]. Nota de classe troba primer A normalitzant (integrant entre 0 i ∞. Exercise 17Use a Monte Carlo procedure to ﬁnd the area below the curve y = x2 in the interval [0, 1]. Example A real number is chosen at random from [0, 1] with uniform probability, and then this number issquared. Let X represent the result. What is the cumulative distribution function of X? What isthe density of X? We begin by letting U represent the chosen real number. Then X = U 2 . If 0 ≤ x ≤ 1, then wehave F (X) = P (X ≤ x) = P (U 2 ≤ x) √ = P (U ≤ x) √ = x.Probability 18
19. 19. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 8: Distribution and density for X = U 2 . It is clear that X always takes on a value between 0 and 1, so the cumulative distribution functionof X is given by   0, √ if x ≤ 0, F (X) = x, if 0 ≤ x ≤ 1, 1, if x ≥ 1.  From this we easily calculate that the density function of X is   0, √ if x ≤ 0, fX (x) = 1/(2 x), if 0 ≤ x ≤ 1, 0, if x > 1. Note that F (X) is continuous, but fX (x) is not. (See Figure ??.)5.3 Joint probability distributionsWe need to deﬁne them when we consider sets of random variables, not necessarily independent onefrom each other. We say then that for, for example, two random variables X and Y , one may deﬁnea joint probability density function by (discrete variables) Pr(X = xi , Y = yj ) = f (xi , yj )or (continuous variables) Pr(x < X ≤ x + dx, y < Y ≤ y + dy) = f (x, y)dxdyIf the variables are independent with respective probability distributions g(x) and h(x), Pr(X = xi , Y = yj ) = g(xi )h(yj ) Pr(x < X ≤ x + dx, y < Y ≤ y + dy) = g(x)h(x)dxdyProbability 19
20. 20. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 9: Solution for exercise ?? Exercise 18 The independent random variables X and Y have the PDFs g(x) = e−x and h(y) = 2e−2y ,respectively. Calculate the probability that X lies in the interval 1 < X ≤ 2 and Y lies in theinterval 0 <≤ 1. Nota de classe simplement caldr` integrar l’equaci´ anterior, donat que parlem de variables independents a o Exercise 19 A ﬂat table is ruled with parallel straight lines a distance D apart, and a thin needle of lengthl < D is tossed onto the table at random. What is the probability that the needle will cross a line?(see http://en.wikipedia.org/wiki/Buffon%27s_needle for a solution). Nota de classe pgina 1040 de [?] (ﬁg. ??)Probability 20
21. 21. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences6 Properties of distributions6.1 Expectation valueThe average or expectation value E[g(X)] of any function g(X) of the random variable X thatfollows a distribution f (x) is deﬁned as i g(xi )f (xi ) for a discrete distribution, E[g(X)] = ∞ −∞ g(x)f (x)dx for a continuous distributionThe extension to a joint distribution is immediate. Thus, for any probaility density function f (x, y),the expectation value of any function g(X, Y ) of the random vairables X and Y is given by:: i j g(xi , yi )f (xi , yi ) for a discrete distribution, E[g(X, Y )] = ∞ ∞ −∞ −∞ g(x, y)f (x, y)dxdy for a continuous distribution It easily follows that the expectation value fullﬁlls these properties: 1. if a is a constant, E[a] = a; 2. if a is a constant, then E[ag(X)] = aE[g(X)]; 3. if g(X) = s(X) + t(X) then E[g(X)] = E[s(X)] + E[t(X)]6.2 MeanThe mean of a random variable X is simply the expectation value of itself: µ =< x >= E[X] = i xi f (xi ) for a discrete distribution, xf (x)dx for a continuous distribution Exercise 20 The probability of ﬁnding a 1s electron in a hydrogen atom in a given inﬁnitessimal volume dV isψ ∗ ψdV , where the quantum mechanical wavefunction ψ is given by ψ = A exp −r/a0. Find the value of the real constant A and deduce the mean distance of the electron from the origin. Nota de classe Let us consider the random variable R being the distance of the electron from the origin. The1s orbital is spherically symmetric, we may consider the inﬁnitessimal volume element dV as theProbability 21
22. 22. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciencesspherical shell with inner radius r and outer radius r + dr. Thus, dV = 4πr2 dr and the PDF of Ris simply: Pr(r < R ≤ r + dr) = f (r)dr = 4πr2 A2 e−2r/a0 drThe value of A is obtained by requiring the total probability to be 1. By parts we obtain A =1/(πa3 )1/2 . Finally, by using the deﬁnition of the expectation value, we obtain: 0 ∞ ∞ 4 3a0 E[R] = rf (r)dr = r3 e−2r/a0 dr = 0 a3 0 0 26.3 Mode and medianMode value of the random variable X at which the probability (density) reaches its maximum. If this occurs more than once, all maximums are equally considered the mode.Median value of the random variable X at which the cumulative probability (density) function F (x) takes value 1/2. Related to it are the lower and upper quatiles Q1 and Qu , deﬁned as F (Q1 ) = 1/4 and F (Qu ) = 3/4. Exercise 21 Find the mode of the PDF for the distance from the origin of the electron whose wavefunction wasgiven in Exercise ??. Nota de classe noms cal derivar la f(r) de l’exercici anterior respecte r6.4 Variance and standard deviation; momentsThe variance V [X] of a distribution, also written σ 2 , is deﬁned as − µ)2 f (xj ) for a discrete distribution j (xj V [X] = E[(X − µ)2 ] = (12) (x − µ)2 f (x)dx for a continuous distributionThe standard deviation, σ is the square root of the variance, and, roughly speaking, shows the spreadof the distribution. From the deﬁnition of the variance it follows that: • V [a] = 0, • V [aX + b] = a2 V [X]Probability 22
23. 23. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Exercise 22 Find the standard deviation of the PDF for the distance from the origin of the electron whosewavefunction was discussed above. Nota de classe pgina 988 de [?]. Basicament nomes cal aplicar la frmula, i s’obt V [R] = 3a2 /4 0 Using the deﬁntion of the variance can also be shown that (Bienaym-Chebyshev inequality) σ2 Pr(|X − µ| ≥ c) ≤ c2which is very useful to determine the amount of “space” outside a given interval around the mean.For example, 1 Pr(|X − µ| ≥ 2σ) ≤ 4and 1 Pr(|X − µ| ≥ 3σ) ≤ 9which holds for any distribution f (x) that possesses a variance. Deﬁning the mean as the ﬁrst moment of X, as it is deﬁned as the sum or the integral of theprobability density function multiplied by the ﬁrst power of x, we can extend it to other moments: xk f (xi ) for a discrete distribution, µk = E[X k ] = i i k x f (x)dx for a continuous distributionIt is easy to see that the variance can be expressed in terms of the second moment: V [X] = E[X 2 ] − µ2 Exercise 23 A biased die has probabilities p/2, p, p, p, 2p of showing 1, 2, 3, 4, 5, 6 respectively. Find the mean,the second moment and the variance of the probability distribution. Nota de classe veure ﬁgura ?? de la pgina 990 de [?] In an analogous way we deﬁne moments we also deﬁne central moments − µ)k f (xi ) for a discrete distribution, i (xi νk = E[(X − µ)k ] = (x − µ)k f (x)dx for a continuous distributionProbability 23
24. 24. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 10: Solution for exercise ??7 Functions of random variablesSometimes we are more interested in functions that are derived from a given random variable forwhich the PDF is known, f (x). The question is how to ﬁnd the PDF for the related random variableY = Y (X), where Y (X) is some function of X.7.1 Discrete random variablesThe probability function for Y is given by j f (xj ) if y = yi g(y) = 0 otherwisewhere the sum extends over those values of j for which yi = Y (xj ). This is easy to evaluate just inthe case that the function Y (X) possesses a single-valued inverse X(Y ).7.2 Continuous random variablesIn this case, we obtain dx g(y) = f (x(y)) dyAgain, this is easy to evalate if Y (X) possesses a single valued inverse, but not so easy otherwise.However, in general, a closed form expression may be obtained in the case where there exist singleProbability 24
25. 25. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 11: Solution for exercise ??valued functions x1 (y) and x2 (y) giving the two values of x that correspond to any given value of y.Thus: dx1 dx2 g(y) = f (x1 (y)) + f (x2 (y)) dy dy Exercise 24 The random variable X is Gaussian distributed with mean µ and variance σ 2 . Find the PDF ofthe new variable Y = (X − µ)2 /σ 2 . Nota de classe veure ﬁgura ?? de [?]8 Important distributions8.1 Discrete distributions8.1.1 Uniform distributionThe uniform distribution on a sample space Ω containing n elements is the function m deﬁned by 1 m(ω) = nfor every ω ∈ Ω.Probability 25
26. 26. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Example Consider the experiment that consists of rolling a pair of dice. We take as the sample space Ω theset of all ordered pairs (i, j) of integers with 1 ≤ i ≤ 6 and 1 ≤ j ≤ 6. Thus, Ω = { (i, j) : 1 ≤ i, j ≤ 6 } .To determine the size of Ω, we note that there are six choices for i, and for each choice of i thereare six choices for j, leading to 36 diﬀerent outcomes. Let us assume that the dice are not loaded.In mathematical terms, this means that we assume that each of the 36 outcomes is equally likely, orequivalently, that we adopt the uniform distribution function on Ω by setting 1 m((i, j)) = , 1 ≤ i, j ≤ 6 . 36What is the probability of getting a sum of 7 on the roll of two dice—or getting a sum of 11? Theﬁrst event, denoted by E, is the subset E = {(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)} .A sum of 11 is the subset F given by F = {(5, 6), (6, 5)} .Consequently, 1 1 P (E) = ω∈E m(ω) = 6 · 36 = 6 , 1 1 P (F ) = ω∈F m(ω) = 2 · 36 = 18 . What is the probability of getting neither snakeeyes (double ones) nor boxcars (double sixes)?The event of getting either one of these two outcomes is the set E = {(1, 1), (6, 6)} .Hence, the probability of obtaining neither is given by ˜ 2 17 P (E) = 1 − P (E) = 1 − = . 36 18 Example Suppose two pennies are ﬂipped once each. There are several “reasonable” ways to describe thesample space. One way is to count the number of heads in the outcome; in this case, the samplespace can be written {0, 1, 2}. Another description of the sample space is the set of all ordered pairsof H’s and T ’s, i.e., {(H, H), (H, T ), (T, H), (T, T )}.Both of these descriptions are accurate ones, but it is easy to see that (at most) one of these, ifassigned a constant probability function, can claim to accurately model reality. In this case, asopposed to the preceding example, the reader will probably say that the second description, witheach outcome being assigned a probability of 1/4, is the “right” description. This conviction is dueto experience; there is no proof that this is the way reality works.Probability 26
27. 27. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences8.1.2 Binomial distributionThe binomial distribution describes processes that consider a number of independent identical trials ¯with two possible outcomes, A and B = A (success and failure). X is the number of times A occurswhen doing n trials. The binomial distribution takes the form: f (x) = Pr(X = x) =n Cx px q n−x =n Cx px (1 − p)n−x (13) Exercise 25 If a single six-sided die is rolled ﬁve times, what is the probability that a six is thrown exactly threetimes? One can show that, for the binomial distribution: n n f (x) = .n Cx px q n−x = (p + q)n = 1 x=0 x=0and that E[X] = np and V [X] = npq.8.2 A continuous distribution: GaussianThe PDF for a Gaussian distribution of a random variable X, with mean µ and variance σ 2 takesthe form: 2 1 1 x−µ f (x) = √ exp − σ 2π 2 σ Exercise 26Find the inﬂection points for the Gaussian distribution Let us consider the random variable Z = (X − µ)/σ, which PDF takes the form: 1 z2 φ(z) = √ exp − 2π 2This is called the standard Gaussian distribution with µ = 0 and σ 2 = 1, being Z the standardvariable. The cumulative probability function for the standard Gaussian distribution can be evaluatedwith z 1 u2 Φ(z) = Pr(Z < z) = √ exp − du 2π −∞ 2Probability 27
28. 28. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences a22 s2 a21 a32 a13 a11 s1 s3 a33 a51 a34 a41 a35 a54 s5 s4 a45 a55 a44 Figure 12: An example of a Markov model with 5 states and diﬀerent transition probabilities 9 Markov chains A Markov model is a probabilistic process over a ﬁnite set of states, s1 , s2 , ..., sN . Each state transition generates a character from the alphabet of the process. The probability of each state coming up next is denoted by Pr(qt = sj ), where qt is the actual state at time t. Depending on whether such probability depends or not on previous states, we distinguish Markov models of diﬀerent orders:Order 0 which have no memory: Pr(qt = sj |qt−1 = si , qt−2 = sk , . . .) = Pr(qt = sj ) = Pr(qt = sj ) for all points t and t in the sequence. It is equivalent to a multinomial probability distribution (see Eq. ?? for a case in which would just have two possible states.Order 1 having memory of size 1: Pr(qt = sj |qt−1 = si , qt−2 = sk , . . .) = Pr(qt = sj |qt−1 = si ). Such models can be rationalized as N order 0 Markov models, one for each previous state si .Order m In general, m is called the length or context upon which the probabilities of the possible values of the next step depend. Here we will just consider order 1 processes in which the right hand side of the above expres- sions is independent of time (time-homogeneous processes), thereby leading to the set of transition probabilities aij of the form: aij = Pr(qt = sj |qt−1 = si ), 1 ≤ i, j ≤ N N with aij ≥ 0 and j=1 aij = 1. See Figure ?? for a graphical example. The above stochastic process is an observable Markov model, since the output of the process is the set of states at each time, where each state is a physical observable event. Probability 28
29. 29. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences A Markov chain is a random process where all information about the future is contained in thepresent state (Markov model of order 1). The probability of going from state i to state j in n states is given by (n) aij = Pr(qt = sj |q0 = si )so (n) (k) (n−k) aij = air arj r∈Swhere S is the state space of the Markov chain. A state j is accessible from a state i (i → j) if a a system started in state i has non-zeroprobability of reaching j at some point: (n) Pr(qt = sj |q0 = si ) = aij > 0 A state i communicates with state j if both i → j and j → i. Then we can deﬁne communicatingclass as sets of states that communicate each other but not with the exterior and closed classes whenthere is no probability of leaving the class. A Markov chain is said to be irreductible if S is a singlecommunicating class: if it is possible to get to any state from any state. A state si has period k if any return to it occurs in multiples of k (greatest common divisor). Astate is aperiodic if k = 1. A state si is transient if, given that we start at si , there is a non-zero probability that we willnever return to si . Formally: Pr(Ti = ∞) > 0being Ti a random variable that represents the ﬁrst return time to state si (the ”hitting time”). Onthe contrary, a state is said to be recurrent or persistent if not transient and positive recurrent if alsoexpected value for the return time is ﬁnite: Mi = E[Ti ] Finally, a state is called absorbing if it is not possible to leave it: aii = 1 and, thus, aij = 0 forall i = j. An important deﬁnition is that of an ergodic state: aperiodic and positive recurrent (all can bereached at a ﬁnite time). If all states in a Markov chain are ergodic, the chain is said to be ergodic.9.1 Steady state analysis of Markov chainsIn a time-homogeneous Markov chain the vector π is called a stationary distribution (or invariantmeasure) if πj = 1 and πj = aij i∈SProbability 29
30. 30. MAT: 2011-31035-T1 MSc Bioinformatics for Health SciencesAn irreductible chain has a stationary distribution if and only if all of its states are positive recurrent.In that case, π is unique and is related to the expected return time: 1 πj = Mjand, if the chain is also aperiodic, then for any i and j: (n) 1 lim a = n→∞ ij MjThe chain converges to the stationary distribution regardless of where it begins! Such π is calledthe equilibrium distribution of the chain.9.2 Transition matrix Nota de classe exemples macos a http://www.sosmath.com/matrix/markov/markov.html aij = Pr(qt+1 = sj |qt = si )is a right stochastic matrix, because each row sums 1 and all values are positive. If the Markov chainis time-homogeneous, the k-step transition probability can be computed by P k . The stationary distribution π is a row vector that fullﬁlls: π = πP(a left eigenvector of the transition matrix with the eigenvalue 1). One can also see vector π as aﬁxed (equilibrium) point of the linear transformation P , which always exists but is only unique whenthe Markov chain is irreductible and aperiodic. In such case, P k converges to a rank-one matrix inwhich each row is the stationary distribution π: lim P k = 1π k→∞ Nota de classe to be completed with reversible Markov chains and Bernouilli schemes and processes, as well asgeneral state spaces Exercise 27 For example, imagine a 3-state Markov model of the weather taken from [?]. Every day the weatheris observed at noon and anotated in one of three diﬀerent states:Probability 30
31. 31. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences Figure 13: idees de soluci de problemastate 1: rainystate 2: cloudystate 3: sunnyThe transition probabilities are set to   0.4 0.3 0.3 A = aij =  0.2 0.6 0.2  0.1 0.1 0.8provided that on day 1, it is sunny, what is the probability that we will have the series: sunny-cloudy-sunny-sunny-sunny-rainy? What is the stationary distribution for the MArkov chain? Howwould you design a random generation of ”weathers” based on this Markov Chain? Nota de classe noms cal multiplicar probabilitats un cop construt el tree Exercise 28 A large biased coin A is described by the order 0 Markov model Pr(H) = 0.6, Pr(T ) = 0.4. A smallbiased coin B by Pr(h) = 0.3, Pr(t) = 0.7 and a third biased coin C by Pr(H) = 0.8, Pr(T ) = 0.2.Build the Markov chain and the transition matrix for the following process: 1. Let X equal one of the two ﬁrst coins A or B 2. repeat (a) throw the coin X and write down the result (b) toss coin C (c) if C shows T then change X to the other coin; otherwise leave X unchanged 3. until enough Nota de classe Easy, nom´s cal construir un model de Markov de 4 estats H,T,h,t i pensar en la probabilitat de efer transicions ii o ij, que dependr dels resultats de llanar dues monedes cada cop (la C per decidirProbability 31
32. 32. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciencesi A o B per anotar)9.3 Selected applications • state of ion channels in cell membranes • distribution of shapes of cells in a tissue (Figure ??) • Brownian motion and the ergodic hypothesis in physics • the Metropolis-Hastings algorithm to sample a given distribution with known equilibrium distribution10 Hidden Markov ModelsReal world processes produce signals that can be: • discrete (letters in an alphabet) or continuous (temperature) • stationary (invariant to time) or nonstationary (signal properties varying with time) • pure (coming from just one source) or corrupted (by noise)We aim at ﬁnding characteristics of the real-world signals in terms of signal models. Why? 1. the signal model can provide the basis for a theoretical model that will allow reproducing the signal 2. on the contrary, without having the source available (reason is unknown or signal is expensive to reproduce), we may learn a lot by simply analyzing the signal 3. they work very well in practice... ;-) The signal model can be deterministic (we know something about the signal, like its mathematicalexpression, the physics behind...) or statistical (Gaussian processes, Poisson processes, Markovprocesses, hidden Markov processes...). In an statistical model we assume that the signal can be wellcharacterized as a parametric random process, with parameters that can be precisely determined. HMM design includes the solving of three fundamental problems[?]: 1. what is the probability (or likelihood) of a sequence of observations given an speciﬁc HMM? 2. what is the best sequence of model states? 3. how do we parametrize the HMM in order to best account for the observed signal?Probability 32
33. 33. MAT: 2011-31035-T1 MSc Bioinformatics for Health SciencesFigure 14: A simple Markov Chain Model to show why many diﬀerent multicellular organisms allhave the same epithelial cell topology. The variable ”how many neighbors each cell has”, shows thesame distribution in diﬀerent organisms.Probability 33
34. 34. MAT: 2011-31035-T1 MSc Bioinformatics for Health SciencesFigure 15: Toy model for an HMM applied to sequence analysis taken from [?]. The examplerepresents a simpliﬁed view of a 5’ splice site recognition problem with three possible states, exon,5’ splice site and intron An HMM involves having the following material: 1. the symbol alphabet to be emmitted, with size M . 2. the number of states in the model, N . 3. the emission probabilities bi (k) = Pr(vk att|qt = si ), 1 ≤ j ≤ N, 1 ≤ k ≤ M . 4. the transition probabilities, aij . 5. the initial state distribution, πi = Pr(q1 = si ), 1 ≤ i ≤ N . A toy model for a HMM is shown in Figure ??. HMMs do not deal well with correlations between observations, as they only depend on theunderlying states. For example, for secondary structure analysis of RNA or long range correlationsin protein 3D structure analysis, a HMM is not appropriate.11 Sources of information • Numerical recipes: http://www.nr.comProbability 34
35. 35. MAT: 2011-31035-T1 MSc Bioinformatics for Health Sciences • LTCC math department http://www.ltcconline.net/greenl/courses/204/204.htm • SOS MATH: http://www.sosmath.com/diffeq/diffeq.html • Probability: http://www.dartmouth.edu/~chance/teaching_aids/articles.html • Software: – R: http://www.r-project.org/ – Octave: http://www.octave.org/ – gnuplot http://www.gnuplot.info/Probability 35