Probability for Machine Learning
Here is a scant introduction to an important subject in Machine Learning. However, we are looking to work with our Probability Professor, Ofelia Begovich to write a series of notes in basic probability for something else, to improve this introduction. Nevertheless, there are still several things like:
1.- Linear Algebra
2.- Topology
3.- Mathematical Analysis
4.- Optimization
That need to be addressed, thus I am working in a class for intelligent systems for that.
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
Review of Do and Batzoglou. "What is the expectation maximization algorith?" Nat. Biotechnol. 2008;26:897. Also covers the Data Augmentation and Stan implementation. Resources at https://github.com/kaz-yos/em_da_repo
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
Review of Do and Batzoglou. "What is the expectation maximization algorith?" Nat. Biotechnol. 2008;26:897. Also covers the Data Augmentation and Stan implementation. Resources at https://github.com/kaz-yos/em_da_repo
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
SA is a global optimization technique.
It distinguishes between different local optima.
It is a memory less algorithm & the algorithm does not use any information gathered during the search.
SA is motivated by an analogy to annealing in solids.
& it is an iterative improvement algorithm.
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
In this presentation is given an introduction to Bayesian networks and basic probability theory. Graphical explanation of Bayes' theorem, random variable, conditional and joint probability. Spam classifier, medical diagnosis, fault prediction. The main software for Bayesian Networks are presented.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
Artificial Intelligence 06.2 More on Causality Bayesian NetworksAndres Mendez-Vazquez
Here, I talk more about the causality idea in order to define D-Separation and the initial algorithm for finding it. I put some extra example of the algorithm...
Hopefully, It is good enough for you...
SA is a global optimization technique.
It distinguishes between different local optima.
It is a memory less algorithm & the algorithm does not use any information gathered during the search.
SA is motivated by an analogy to annealing in solids.
& it is an iterative improvement algorithm.
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Classification: Basic Concepts and Decision Treessathish sak
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Basic of Decision Tree Learning. This slide includes definition of decision tree, basic example, basic construction of a decision tree, mathlab example
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
In this presentation is given an introduction to Bayesian networks and basic probability theory. Graphical explanation of Bayes' theorem, random variable, conditional and joint probability. Spam classifier, medical diagnosis, fault prediction. The main software for Bayesian Networks are presented.
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
Artificial Intelligence 06.2 More on Causality Bayesian NetworksAndres Mendez-Vazquez
Here, I talk more about the causality idea in order to define D-Separation and the initial algorithm for finding it. I put some extra example of the algorithm...
Hopefully, It is good enough for you...
How To Speak Spanish vh02: 80 Spanish words, expressions and sentences to boost your Spanish vocabulary. Included links to videos with the pronunciation, and several tests. Vocabulary is essential if you want to speak Spanish fluently.
The missing links in software estimation: Work, Team Loading and Team PowerLuigi Buglione
This presentation investigates the theoretical foundation of the basic concepts used in software effort estimation, productivity measurement and benchmarking. By elaborating on how similar concepts are defined and used in well-established engineering fields, we aim to shed light on some inconsistent and fallacious use of concepts and units of measure, resulting misconceptions and their consequences in project planning. Particularly, we focus on ‘Work’, ‘Team Power’ and ‘Team Loading’, analyzing the way many studies from the ‘70s on faced such issue. Too often projects fail for being late and not always adding new resources allows respecting established milestones as well as the established quality levels. After setting the theoretical layout, we present the results of an empirical investigation we made using the data in the International Software Benchmarking Standards Group (ISBSG) dataset D&E (Development & Enhancement) v13, using both COSMIC and IFPUG data for Business and Real-Time applications. The results indicate that a considerable number of projects might have been poorly planned and utilized human resources inefficiently, and hence paid much higher costs. Hence, we suggest software companies to revisit the productivity data of the past projects as well as evaluating the new ones by measuring Team Power, Team Loading and comparing to Team Size utilized.
Here are my slides for my preparation class for possible students for the Master in Electrical Engineering and Computer Science (Specialization in Computer Science)... for the entrance examination here at Cinvestav GDL.
Here is the basic introduction to the probability used in my Analysis of Algorithms course at the Cinvestav Guadalajara. They go from the basic axioms to the Expected Value and Variance.
I am Racheal W. I am a Probability Assignment Expert at statisticsassignmenthelp.com. I hold a Masters in Statistics from, Massachusetts Institute of Technology, USA.
I have been helping students with their homework for the past 7 years. I solve assignments related to probability.
Visit statisticsassignmenthelp.com or email info@statisticsassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with probability assignments.
3 PROBABILITY TOPICSFigure 3.1 Meteor showers are rare, .docxtamicawaysmith
3 | PROBABILITY TOPICS
Figure 3.1 Meteor showers are rare, but the probability of them occurring can be calculated. (credit: Navicore/flickr)
Introduction
Chapter Objectives
By the end of this chapter, the student should be able to:
• Understand and use the terminology of probability.
• Determine whether two events are mutually exclusive and whether two events are independent.
• Calculate probabilities using the Addition Rules and Multiplication Rules.
• Construct and interpret Contingency Tables.
• Construct and interpret Venn Diagrams.
• Construct and interpret Tree Diagrams.
It is often necessary to "guess" about the outcome of an event in order to make a decision. Politicians study polls to guess
their likelihood of winning an election. Teachers choose a particular course of study based on what they think students can
comprehend. Doctors choose the treatments needed for various diseases based on their assessment of likely results. You
may have visited a casino where people play games chosen because of the belief that the likelihood of winning is good. You
may have chosen your course of study based on the probable availability of jobs.
You have, more than likely, used probability. In fact, you probably have an intuitive sense of probability. Probability deals
with the chance of an event occurring. Whenever you weigh the odds of whether or not to do your homework or to study
for an exam, you are using probability. In this chapter, you will learn how to solve probability problems using a systematic
approach.
Your instructor will survey your class. Count the number of students in the class today.
• Raise your hand if you have any change in your pocket or purse. Record the number of raised hands.
CHAPTER 3 | PROBABILITY TOPICS 163
• Raise your hand if you rode a bus within the past month. Record the number of raised hands.
• Raise your hand if you answered "yes" to BOTH of the first two questions. Record the number of raised hands.
Use the class data as estimates of the following probabilities. P(change) means the probability that a randomly chosen
person in your class has change in his/her pocket or purse. P(bus) means the probability that a randomly chosen person
in your class rode a bus within the last month and so on. Discuss your answers.
• Find P(change).
• Find P(bus).
• Find P(change AND bus). Find the probability that a randomly chosen student in your class has change in his/her
pocket or purse and rode a bus within the last month.
• Find P(change|bus). Find the probability that a randomly chosen student has change given that he or she rode a
bus within the last month. Count all the students that rode a bus. From the group of students who rode a bus,
count those who have change. The probability is equal to those who have change and rode a bus divided by those
who rode a bus.
3.1 | Terminology
Probability is a measure that is associated with how certain we are of outcomes of a particular experiment or activity.
An e ...
Senior data scientist and founder of the company Intelligentia Data I+D SA de CV. We are offering consultancy services, development of projects and products in Machine Learning, Big Data, Data Sciences and Artificial Intelligence.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
It has been almost 62 years since the invention of the term Artificial Intelligence by Samuel and Minsky et al. at the Dartmouth workshop College in 1956 (“Dartmouth Summer Research Project on Artificial Intelligence”) where this new area of Computer Science was invented. However, the history of Artificial Intelligence goes back to previous millennia, when the Greeks in their Myths spoke about golden robots at Hephaestus, and the Galatea of Pygmalion. They were the first automatons known at the dawn of history, and although these first attempts were only myths, automatons were invented and built through multiple civilizations in history. Nevertheless, these automatons resembled in quite limited way their final objectives, representing animals and humans. In spite of that, the greatest illusion of an automaton, the Turk by Wolfgang von Kempelen, inspired many people, trough its exhibitions, as Alexander Graham Bell and Charles Babbage to develop inventions that would change forever human history. Thus, the importance of the concept “Artificial Intelligence” as a driver of our technological dreams. And although Artificial Intelligence has never been defined in a precise practical way, the amount of research and methods that have been developed to tackle some of its basics tasks have been and are quite humongous. Thus, the importance of having an introduction to the concepts of Artificial Intelligence, thus the dream can continue.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
1. Machine Learning for Data Mining
Probability Review
Andres Mendez-Vazquez
May 14, 2015
1 / 87
2. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
2 / 87
3. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
3 / 87
4. Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
5. Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
6. Gerolamo Cardano: Gambling out of Darkness
Gambling
Gambling shows our interest in quantifying the ideas of probability for
millennia, but exact mathematical descriptions arose much later.
Gerolamo Cardano (16th century)
While gambling he developed the following rule!!!
Equal conditions
“The most fundamental principle of all in gambling is simply equal
conditions, e.g. of opponents, of bystanders, of money, of situation, of the
dice box and of the dice itself. To the extent to which you depart from
that equity, if it is in your opponent’s favour, you are a fool, and if in your
own, you are unjust.”
4 / 87
7. Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
8. Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
9. Gerolamo Cardano’s Definition
Probability
“If therefore, someone should say, I want an ace, a deuce, or a trey, you
know that there are 27 favourable throws, and since the circuit is 36, the
rest of the throws in which these points will not turn up will be 9; the
odds will therefore be 3 to 1.”
Meaning
Probability as a ratio of favorable to all possible outcomes!!! As long all
events are equiprobable...
Thus, we get
P(All favourable throws) =
Number All favourable throws
Number of All throws
(1)
5 / 87
10. Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
11. Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
12. Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
13. Intuitive Formulation
Empiric Definition
Intuitively, the probability of an event A could be defined as:
P(A) = lim
n→∞
N(A)
n
Where N(A) is the number that event a happens in n trials.
Example
Imagine you have three dices, then
The total number of outcomes is 63
If we have event A = all numbers are equal, |A| = 6
Then, we have that P(A) = 6
63 = 1
36
6 / 87
14. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
7 / 87
15. Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
16. Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
17. Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
18. Axioms of Probability
Axioms
Given a sample space S of events, we have that
1 0 ≤ P(A) ≤ 1
2 P(S) = 1
3 If A1, A2, ..., An are mutually exclusive events (i.e. P(Ai ∩ Aj) = 0),
then:
P(A1 ∪ A2 ∪ ... ∪ An) =
n
i=1
P(Ai)
8 / 87
21. Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
22. Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
23. Example
Setup
Throw a biased coin twice
HH .36 HT .24
TH .24 TT .16
We have the following event
At least one head!!! Can you tell me which events are part of it?
What about this one?
Tail on first toss.
10 / 87
24. We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
25. We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
26. We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
27. We need to count!!!
We have four main methods of counting
1 Ordered samples of size r with replacement
2 Ordered samples of size r without replacement
3 Unordered samples of size r without replacement
4 Unordered samples of size r with replacement
11 / 87
28. Ordered samples of size r with replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n × ... × n = nr
Example
If you throw three dices you have 6 × 6 × 6 = 216
12 / 87
29. Ordered samples of size r with replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n × ... × n = nr
Example
If you throw three dices you have 6 × 6 × 6 = 216
12 / 87
30. Ordered samples of size r without replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n − 1 × ... × (n − (r − 1)) = n!
(n−r)!
Example
The number of different numbers that can be formed if no digit can be
repeated. For example, if you have 4 digits and you want numbers of size
3.
13 / 87
31. Ordered samples of size r without replacement
Definition
The number of possible sequences (ai1 , ..., air ) for n different numbers is
n × n − 1 × ... × (n − (r − 1)) = n!
(n−r)!
Example
The number of different numbers that can be formed if no digit can be
repeated. For example, if you have 4 digits and you want numbers of size
3.
13 / 87
32. Unordered samples of size r without replacement
Definition
Actually, we want the number of possible unordered sets.
However
We have n!
(n−r)! collections where we care about the order. Thus
n!
(n−r)!
r!
=
n!
r! (n − r)!
=
n
r
(2)
14 / 87
33. Unordered samples of size r without replacement
Definition
Actually, we want the number of possible unordered sets.
However
We have n!
(n−r)! collections where we care about the order. Thus
n!
(n−r)!
r!
=
n!
r! (n − r)!
=
n
r
(2)
14 / 87
34. Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
35. Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
36. Unordered samples of size r with replacement
Definition
We want to find an unordered set {ai1 , ..., air } with replacement
Use a digit trick for that
Look at the Board
Thus
n + r − 1
r
(3)
15 / 87
37. How?
Change encoding by adding more signs
Imagine all the strings of three numbers with {1, 2, 3}
We have
Old String New String
111 1+0,1+1,1+2=123
112 1+0,1+1,2+2=124
113 1+0,1+1,3+2=125
122 1+0,2+1,2+2=134
123 1+0,2+1,3+2=135
133 1+0,3+1,3+2=145
222 2+0,2+1,2+2=234
223 2+0,2+1,3+2=225
233 1+0,3+1,3+2=233
333 3+0,3+1,3+2=345
16 / 87
38. How?
Change encoding by adding more signs
Imagine all the strings of three numbers with {1, 2, 3}
We have
Old String New String
111 1+0,1+1,1+2=123
112 1+0,1+1,2+2=124
113 1+0,1+1,3+2=125
122 1+0,2+1,2+2=134
123 1+0,2+1,3+2=135
133 1+0,3+1,3+2=145
222 2+0,2+1,2+2=234
223 2+0,2+1,3+2=225
233 1+0,3+1,3+2=233
333 3+0,3+1,3+2=345
16 / 87
40. Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
41. Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
42. Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
43. Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
44. Example
We have two dices
Thus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6
We have the following events
A ={First dice 1,2 or 3}
B = {First dice 3, 4 or 5}
C = {The sum of two faces is 9}
So, we can do
Look at the board!!! Independence between A, B, C
18 / 87
45. We can use to derive the Binomial Distribution
WHAT?????
19 / 87
46. First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
47. First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
48. First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
49. First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
50. First, we use a sequence of n Bernoulli Trials
We have this
“Success” has a probability p.
“Failure” has a probability 1 − p.
Examples
Toss a coin independently n times.
Examine components produced on an assembly line.
Now
We take S =all 2n ordered sequences of length n, with components
0(failure) and 1(success).
20 / 87
51. Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
52. Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
53. Thus, taking a sample ω
ω = 11 · · · 10 · · · 0
k 1’s followed by n − k 0’s.
We have then
P (ω) = P A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ac
k+1 ∩ . . . ∩ Ac
n
= P (A1) P (A2) · · · P (Ak) P Ac
k+1 · · · P (Ac
n)
= pk
(1 − p)n−k
Important
The number of such sample is the number of sets with k elements.... or...
n
k
21 / 87
54. Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
55. Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
56. Did you notice?
We do not care where the 1’s and 0’s are
Thus all the probabilities are equal to pk (1 − p)k
Thus, we are looking to sum all those probabilities of all those
combinations of 1’s and 0’s
k 1’s
p ωk
Then
k 1’s
p ωk
=
n
k
p (1 − p)n−k
22 / 87
57. Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
58. Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
59. Proving this is a probability
Sum of these probabilities is equal to 1
n
k=0
n
k
p (1 − p)n−k
= (p + (1 − p))n
= 1
The other is simple
0 ≤
n
k
p (1 − p)n−k
≤ 1 ∀k
This is know as
The Binomial probability function!!!
23 / 87
60. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
24 / 87
61. Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
62. Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
63. Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
64. Different Probabilities
Unconditional
This is the probability of an event A prior to arrival of any evidence, it is
denoted by P(A). For example:
P(Cavity)=0.1 means that “in the absence of any other information,
there is a 10% chance that the patient is having a cavity”.
Conditional
This is the probability of an event A given some evidence B, it is denoted
P(A|B). For example:
P(Cavity/Toothache)=0.8 means that “there is an 80% chance that
the patient is having a cavity given that he is having a toothache”
25 / 87
65. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
26 / 87
66. Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
67. Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
68. Posterior Probabilities
Relation between conditional and unconditional probabilities
Conditional probabilities can be defined in terms of unconditional probabilities:
P(A|B) =
P(A, B)
P(B)
which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).
Law of Total Probabilities
if B1, B2, ..., Bnis a partition of mutually exclusive events and Ais an event, then
P(A) =
n
i=1
P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).
In addition, this can be rewritten into P(A) =
n
i=1
P(A|Bi)P(Bi).
27 / 87
69. Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
70. Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
71. Example
Three cards are drawn from a deck
Find the probability of no obtaining a heart
We have
52 cards
39 of them not a heart
Define
Ai ={Card i is not a heart} Then?
28 / 87
72. Independence and Conditional
From here, we have that...
P(A|B) = P(A) and P(B|A) = P(B).
Conditional independence
A and B are conditionally independent given C if and only if
P(A|B, C) = P(A|C)
Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain).
29 / 87
73. Independence and Conditional
From here, we have that...
P(A|B) = P(A) and P(B|A) = P(B).
Conditional independence
A and B are conditionally independent given C if and only if
P(A|B, C) = P(A|C)
Example: P(WetGrass|Season, Rain) = P(WetGrass|Rain).
29 / 87
74. Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
75. Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
76. Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
77. Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
78. Bayes Theorem
One Version
P(A|B) =
P(B|A)P(A)
P(B)
Where
P(A) is the prior probability or marginal probability of A. It is
"prior" in the sense that it does not take into account any information
about B.
P(A|B) is the conditional probability of A, given B. It is also called
the posterior probability because it is derived from or depends upon
the specified value of B.
P(B|A) is the conditional probability of B given A. It is also called
the likelihood.
P(B) is the prior or marginal probability of B, and acts as a
normalizing constant.
30 / 87
79. General Form of the Bayes Rule
Definition
If A1, A2, ..., An is a partition of mutually exclusive events and B any
event, then:
P(Ai|B) =
P(B|Ai)P(Ai)
P(B)
=
P(B|Ai)P(Ai)
n
i=1 P(B|Ai)P(Ai)
where
P(B) =
n
i=1
P(B ∩ Ai) =
n
i=1
P(B|Ai)P(Ai)
31 / 87
80. General Form of the Bayes Rule
Definition
If A1, A2, ..., An is a partition of mutually exclusive events and B any
event, then:
P(Ai|B) =
P(B|Ai)P(Ai)
P(B)
=
P(B|Ai)P(Ai)
n
i=1 P(B|Ai)P(Ai)
where
P(B) =
n
i=1
P(B ∩ Ai) =
n
i=1
P(B|Ai)P(Ai)
31 / 87
81. Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
82. Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
83. Example
Setup
Throw two unbiased dice independently.
Let
1 A ={sum of the faces =8}
2 B ={faces are equal}
Then calculate P (B|A)
Look at the board
32 / 87
84. Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
85. Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
86. Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
87. Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
88. Another Example
We have the following
Two coins are available, one unbiased and the other two headed
Assume
That you have a probability of 3
4 to choose the unbiased
Events
A= {head comes up}
B1= {Unbiased coin chosen}
B2= {Biased coin chosen}
Find that if a head come up, find the probability that the two headed
coin was chosen
33 / 87
89. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
90. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
91. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
92. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
93. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
94. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
95. Random Variables I
Definition
In many experiments, it is easier to deal with a summary variable than
with the original probability structure.
Example
In an opinion poll, we ask 50 people whether agree or disagree with a
certain issue.
Suppose we record a “1” for agree and “0” for disagree.
The sample space for this experiment has 250 elements. Why?
Suppose we are only interested in the number of people who agree.
Define the variable X=number of “1” ’s recorded out of 50.
Easier to deal with this sample space (has only 51 elements).
34 / 87
96. Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
97. Thus...
It is necessary to define a function “random variable as follow”
X : S → R
Graphically
35 / 87
98. Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
99. Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
100. Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
36 / 87
101. Random Variables II
How?
What is the probability function of the random variable is being defined
from the probability function of the original sample space?
Suppose the sample space is S = {s1, s2, ..., sn}
Suppose the range of the random variable X =< x1, x2, ..., xm >
Then, we observe X = xi if and only if the outcome of the random
experiment is an sj ∈ S s.t. X(sj) = xj or
P(X = xj) = P(sj ∈ S|X(sj) = xj)
36 / 87
102. Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
103. Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
104. Example
Setup
Throw a coin 10 times, and let R be the number of heads.
Then
S = all sequences of length 10 with components H and T
We have for
ω =HHHHTTHTTH ⇒ R (ω) = 6
37 / 87
105. Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
106. Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
107. Example
Setup
Let R be the number of heads in two independent tosses of a coin.
Probability of head is .6
What are the probabilities?
Ω ={HH,HT,TH,TT}
Thus, we can calculate
P (R = 0) , P (R = 1) , P (R = 2)
38 / 87
108. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
39 / 87
109. Types of Random Variables
Discrete
A discrete random variable can assume only a countable number of values.
Continuous
A continuous random variable can assume a continuous range of values.
40 / 87
110. Types of Random Variables
Discrete
A discrete random variable can assume only a countable number of values.
Continuous
A continuous random variable can assume a continuous range of values.
40 / 87
111. Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
112. Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
113. Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
114. Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
115. Properties
Probability Mass Function (PMF) and Probability Density Function (PDF)
The pmf /pdf of a random variable X assigns a probability for each
possible value of X.
Properties of the pmf and pdf
Some properties of the pmf:
x p(x) = 1 and P(a < X < b) =
b
k=a p(k).
In a similar way for the pdf:
´ ∞
−∞
p(x)dx = 1 and P(a < X < b) =
´ b
a
p(t)dt .
41 / 87
117. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
43 / 87
118. Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
119. Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
120. Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
121. Cumulative Distributive Function I
Cumulative Distribution Function
With every random variable, we associate a function called
Cumulative Distribution Function (CDF) which is defined as follows:
FX (x) = P(f (X) ≤ x)
With properties:
FX (x) ≥ 0
FX (x) in a non-decreasing function of X.
Example
If X is discrete, its CDF can be computed as follows:
FX (x) = P(f (X) ≤ x) = N
k=1 P(Xk = pk).
44 / 87
123. Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
124. Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
125. Cumulative Distributive Function II
Continuous Function
If X is continuous, its CDF can be computed as follows:
F(x) =
ˆ x
−∞
f (t)dt.
Remark
Based in the fundamental theorem of calculus, we have the following
equality.
p(x) =
dF
dx
(x)
Note
This particular p(x) is known as the Probability Mass Function (PMF) or
Probability Distribution Function (PDF).
46 / 87
126. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
127. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
128. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
129. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
130. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
131. Example: Continuous Function
Setup
A number X is chosen at random between a and b
Xhas a uniform distribution
fX (x) = 1
b−a for a ≤ x ≤ b
fX (x) = 0 for x < a and x > b
We have
FX (x) = P {X ≤ x} =
ˆ x
−∞
fX (t) dt (4)
P {a < X ≤ b} =
ˆ b
a
fX (t) dt (5)
47 / 87
133. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
49 / 87
134. Properties of the PMF/PDF I
Conditional PMF/PDF
We have the conditional pdf:
p(y|x) =
p(x, y)
p(x)
.
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
Independence
If X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
135. Properties of the PMF/PDF I
Conditional PMF/PDF
We have the conditional pdf:
p(y|x) =
p(x, y)
p(x)
.
From this, we have the general chain rule
p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).
Independence
If X and Y are independent, then:
p(x, y) = p(x)p(y).
50 / 87
136. Properties of the PMF/PDF II
Law of Total Probability
p(y) =
x
p(y|x)p(x).
51 / 87
137. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
52 / 87
138. Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
139. Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
140. Expectation
Something Notable
You have the random variables R1, R2 representing how long is a call and
how much you pay for an international call
if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)
if 3 < R1 ≤ 6(minute) R2 = 20(cents)
if 6 < R1 ≤ 9(minute) R2 = 30(cents)
We have then the probabilities
P {R2 = 10} = 0.6, P {R2 = 20} = 0.25, P {R2 = 10} = 0.15
If we observe N calls and N is very large
We can say that we have N × 0.6 calls and 10 × N × 0.6 the cost of those
calls
53 / 87
141. Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
142. Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
143. Expectation
Similarly
{R2 = 20} =⇒ 0.25N and total cost 5N
{R2 = 20} =⇒ 0.15N and total cost 4.5N
We have then the probabilities
The total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents per
call
The average
10(0.6N)+20(.25N)+30(0.15N)
N = 10 (0.6) + 20 (.25) + 30 (0.15) =
y yP {R2 = y}
54 / 87
144. Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
145. Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
146. Expected Value
Definition
Discrete random variable X: E(X) = x xp(x).
Continuous random variable Y : E(Y ) =
´
x xp(x)dx.
Extension to a function g(X)
E(g(X)) = x g(x)p(x) (Discrete case).
E(g(X)) =
´ ∞
=∞ g(x)p(x)dx (Continuous case)
Linearity property
E(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))
55 / 87
147. Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
148. Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
149. Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
150. Example
Imagine the following
We have the following functions
1 f (x) = e−x, x ≥ 0
2 g (x) = 0, x < 0
Find
The expected Value
56 / 87
153. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
58 / 87
154. Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
155. Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
156. Example
Suppose
You have that the number of call made per day at a given exchange has a
Poisson distribution with an unknown parameter θ:
p (x|θ) =
θxe−θ
x!
x = 0, 1, 2, ... (6)
We need to obtain information about θ
For this, we observe that certain information is needed!!!
For example
We could need more of certain equipment if θ > θ0
We do not need it if θ ≤ θ0
59 / 87
157. Thus, we want to take a decision about θ
To avoid making an incorrect decision
To avoid losing money!!!
60 / 87
158. Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
159. Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
160. Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
161. Ingredients of statistical decision models
First
N, the set of states
Second
A random variable or random vector X, the observable, whose distribution
Fθ depends on θ ∈ N
Third
A, the set of possible actions:
A = N = (0, ∞)
Fourth
A loss (cost) function L (θ, a), θ ∈ N, a ∈ A:
It represents the loss of taking a decision.
61 / 87
162. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
62 / 87
163. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
164. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
165. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
166. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
167. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
168. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
169. Hypothesis Testing
Suppose
H0 and H1 two subset such that
H0 ∩ H1 = ∅
H0 ∪ H1 = N
In the telephone example
H0 = {θ|θ ≤ θ0}
H1 = {θ|θ > θ1}
In other words
“θ ∈ H0”
“θ ∈ H1”
63 / 87
170. Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
171. Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
172. Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
173. Simple Hypothesis Vs. Simple Alternative
In this specific case
Each H0 and H1 contains one element, θ0 and θ1
Thus
We have that our random variable X which depends on θ:
If we are in H0, X ∼ f0
If we are in H1, X ∼ f1
Thus, the problem
It is deciding whether X has density f0 or f1
64 / 87
174. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
175. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
176. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
177. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
178. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
179. What do we do?
We define a function
ϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x is
observed
We have then
If ϕ (x) = 1, we reject H0
If ϕ (x) = 0, we accept H0
if 0 < ϕ (x) < 1, we toss a coin with probability a of heads
if coins comes up reject H0
if coins comes up tail accept H0
65 / 87
180. Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
181. Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
182. Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
183. Thus
{x|ϕ (x) = 1}
It is called the rejection region or critical section.
And
ϕ is called a test!!!
Clearly the decision could be erroneous!!!
A type 1 error occurs if we reject H0 when H0 is true!!!
A type 2 error occurs if we accept H0 when H1 is true!!!
66 / 87
184. Thus the probability of error when X = x
If H0 is rejected when true
Probability of a type error 1
α =
ˆ ∞
−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when false
Probability of a type error 2
β =
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx (8)
67 / 87
185. Thus the probability of error when X = x
If H0 is rejected when true
Probability of a type error 1
α =
ˆ ∞
−∞
ϕ (x) f0 (x) dx (7)
If H0 is accepted when false
Probability of a type error 2
β =
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx (8)
67 / 87
186. Actually
If the test is an indicator function ϕ (x) = IAccept H0 (x) and
1 − ϕ (x) = IReject H0 (x)
True
True
Retain Reject
68 / 87
187. Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
188. Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
189. Problem!!!
There is not a unique answer to the question of what is a good test
Thus, we suppose there is a nonnegative cost ci associated to error
type i.
In addition, we have a prior probability p of H0 to be true.
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 × α (ϕ) + (1 − p) × c2 × β (ϕ) (9)
69 / 87
190. We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
191. We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
192. We can do the following
The over-all average cost associated with ϕ is
B (ϕ) = p × c1 ×
ˆ ∞
−∞
ϕ (x) f0 (x) dx + (1 − p) × c2 ×
ˆ ∞
−∞
(1 − ϕ (x)) f1 (x) dx
Thus
B (ϕ) =
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx + (1 − p) c2 (1 − ϕ (x)) f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x) + (1 − p) c2f1 (x)] dx
=
ˆ ∞
−∞
[pc1ϕ (x) f0 (x) dx − (1 − p) c2ϕ (x) f1 (x)] dx + ...
(1 − p) c2
ˆ ∞
−∞
f1 (x) dx
We have that
B (ϕ) =
ˆ ∞
−∞
ϕ (x) [pc1f0 (x) − (1 − p) c2f1 (x)] dx + (1 − p) c2
70 / 87
193. Bayes Risk
We have that...
B (ϕ) is called the Bayes risk associated to the test function ϕ
In addition
A test that minimizes B (ϕ) is called a Bayes test corresponding to the
given p, c1, c2, f0 and f1.
71 / 87
194. Bayes Risk
We have that...
B (ϕ) is called the Bayes risk associated to the test function ϕ
In addition
A test that minimizes B (ϕ) is called a Bayes test corresponding to the
given p, c1, c2, f0 and f1.
71 / 87
195. What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
196. What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
197. What do we want?
We want
To minimize
´
S ϕ (x) g (x) dx
We want to find g (x)!!!
This will tell us how to select the correct hypothesis!!!
72 / 87
198. What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
199. What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
200. What do we want?
Case 1
If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S.
Case 2
If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S.
Case 3
If g (x) = 0, ϕ (x) may be chosen arbitrarily.
73 / 87
201. Finally
We choose
g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x) − (1 − p) c2f1 (x) = 0
pc1f0 (x) = (1 − p) c2f1 (x)
pc1
(1 − p) c2
=
f1 (x)
f0 (x)
74 / 87
202. Finally
We choose
g (x) = pc1f0 (x) − (1 − p) c2f1 (x) (10)
We look at the moment where g (x) = 0
pc1f0 (x) − (1 − p) c2f1 (x) = 0
pc1f0 (x) = (1 − p) c2f1 (x)
pc1
(1 − p) c2
=
f1 (x)
f0 (x)
74 / 87
203. Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
204. Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
205. Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
206. Bayes Solution
Thus, we have
Let L (x) = f1(x)
f0(x)
If L (x) > pc1
(1−p)c2
then take ϕ (x) = 1 i.e. reject H0.
If L (x) < pc1
(1−p)c2
then take ϕ (x) = 0 i.e. accept H0.
If L (x) = pc1
(1−p)c2
then take ϕ (x) =anything
75 / 87
207. Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
208. Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
209. Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
210. Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
211. Likelihood Ratio
We have
L is called the likelihood ratio.
For the test ϕ
There is a constant 0 ≤ λ ≤ ∞
ϕ (x) = 1 when L (x) > λ
ϕ (x) = 0 when L (x) < λ
Remark: This is know as the Likelihood Ratio Test (LRT)
76 / 87
212. Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
213. Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
214. Example
Let X be a discrete random variable
x = {0, 1, 2, 3}
We have then
x 0 1 2 3
p0 (x) .1 .2 .3 .4
p1 (x) .2 .1 .4 .3
We have the following likelihood ratio
x 1 3 2 0
L (x) 1
2
3
4
4
3 2
77 / 87
215. Example
We have the following situation
LRT Reject Region Acceptance Region α β
0 ≤ λ < 1
2 All x Empty 1 0
1
2 < λ < 3
4 x = 0, 2, 3 x = 1 .8 .1
3
4 < λ < 4
3 x = 0, 2 x = 1, 3 .4 .4
4
3 < λ < 2 x = 0 x = 1, 2, 3 .1 .8
2 < λ ≤ ∞ Empty All x 0 1
78 / 87
216. Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
217. Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
218. Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
219. Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
220. Example
Assume λ = 3/4
Reject H0 if x = 0, 2
Accept H0 if x = 1
If x = 3, we randomize
i.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus
α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4a
β = p1 (1) + (1 − a) p1 (3) = 0.1 + 0.3 (1 − a)
79 / 87
221. The Graph of B (ϕ)
Thus, we have for each λ value
80 / 87
222. Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
223. Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
224. Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
225. Thus, we have several test
The classic one: Minimax Test
The test that minimize max {α, β}
Which
An admissible test with constant risk (α = β) is minimax
Then
We have only one test where α = β = 0.4 then 3
4 < λ < 4
3, Thus
We reject H0 when x =0 or 2
We accept H0 when x =1 or 3
81 / 87
227. Outline
1 Basic Theory
Intuitive Formulation
Axioms
2 Independence
Unconditional and Conditional Probability
Posterior (Conditional) Probability
3 Random Variables
Types of Random Variables
Cumulative Distributive Function
Properties of the PMF/PDF
Expected Value and Variance
4 Statistical Decision
Statistical Decision Model
Hypothesis Testing
Estimation
83 / 87
228. Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
229. Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
230. Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
231. Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
232. Introduction
Suppose
γ is a real valued function on the set N of states of nature.
Now, we observe X = x, we want to produce a number ψ (x) that is
close to γ (θ).
There are different ways of doing this
Maximum Likelihood (ML).
Expectation Maximization (EM).
Maximum A Posteriori (MAP)
84 / 87
233. Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
234. Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
235. Maximum Likelihood Estimation
Suppose the following
fθ be a density or probability function corresponding to the state of
nature θ.
Assume for simplicity that γ (θ) = θ
If X = x, the ML estimate of θ is given by γ (θ) = ˆθ or the value of θ
that maximizes fθ (x)
85 / 87
236. Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
237. Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
238. Example
Let X have a binomial distribution
With parameters n and θ, 0 ≤ θ ≤ 1
The pdf
pθ (x) =
n
x
θx (1 − θ)n−x
with x = 0, 1, 2, ..., n
Derive with respect to θ
∂
∂θ ln pθ (x) = 0
86 / 87
239. Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87
240. Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87
241. Example
We get
x
θ
−
n − x
1 − θ
= 0 =⇒ ˆθ =
x
n
Now, we can regard X as a sum of independent variables
X = X1 + X2 + ... + Xn
where: Xi is 1 with probability θ or 0 with probability 1 − θ
We get finally
ˆθ (X) =
n
i=1 Xi
n
⇒ lim
n→∞
ˆθ (X) = E (Xi) = θ
87 / 87