SlideShare a Scribd company logo
1 of 42
Learning from Example
and
Learning Probabilistic Model
2017.07.06(Fri)
Junya Tanaka
Agenda
• Learning from Example
oNonparametric Model
oEnsemble Learning
oPractical Machine Learning
• Learning Probabilistic Model
oStatistical Learning
oLearning with Complete Data
oLearning with Hidden Variable : The EM Algorithm
Learning from Example
2017.07.06(Fri)
Junya Tanaka
1.Nonparametric Models
• Parametric Model
oA learning model that summarizes data with a set of parameters
of fixed size (independent of the number of training examples)
oNo matter how much data you throw at a parametric model
oWhen data sets are small, it makes sense to have a strong
restriction on the allowable hypotheses, to avoid overfitting
• Nonparametric Model
oA learning model that cannot be characterized by a bounded
set of parameters
oThe problem with this method is that it does not generalize well
oeach hypothesis we generate simply retains within itself all of
the training examples and uses all of them to predict the next
example
Nearest neighbor models
• given a query xq, find the k examples that are nearest to xq
• Use the notation NN(k, xq) to denote the set of k nearest
neighbors
• To do classification
ofirst find NN(k, xq)
otake the plurality vote of the neighbors
• the decision boundary of k-nearest-neighbors classification
for k= 1 and 5 on the earthquake data
o1-nearest neighbors is overfitting
o5-nearest-neighbors is good
Distance
• The very word “nearest” implies a distance metric.
• Distances are measured with a Minkowski distance or p
norm, defined as
op=2 is Euclidean distance
op=1 is Manhattan distance
oWith Boolean attribute values, the number of attributes on
which the two points differ is called the Hamming distance
• if we change dimension I, we’ll get different nearest
neighbors while keeping the other dimensions the same
• To avoid this, it is common to apply normalization to the
measurements
Curse of dimensionality
• Nearest neighbors works very well in low-dimensional data
• Computational complexity becomes enormous if data
dimensions increases
• Generally, reduce the number of actual distance calculations
oK-d tree
oLocally sensitive hashing
Finding nearest neighbors with k-d trees
• k-d tree (k-dimensional tree)
oA balanced binary tree over data with an arbitrary number of
dimensions
• The construction of a k-d tree is similar to the construction of
a one-dimensional balanced binary tree
• Construction
oAs one moves down the tree, one cycles through the axes used
to select the splitting planes
oPoints are inserted by selecting the median of the points being
put into the subtree, with respect to their coordinates in the axis
being used to create the splitting plane.
Locality-sensitive hashing
• Locality-sensitive hashing (LSH) reduces the
dimensionality of high-dimensional data
• LSH hashes input items so that similar items map to the
same “buckets” with high probability
• LSH differs from conventional and cryptographic hash
functions because it aims to maximize the probability of a
“collision” for similar items
• Locality-sensitive hashing has much in common with data
clustering and nearest neighbor search
Nonparametric regression
• Nonparametric approaches to regression
• (a)connect the dots
oPiecewise linear nonparametric regression
• (b)3-nearest neighbors average
ok-nearest-neighbors regression
Nonparametric regression
• (c) 3-nearest-neighbors linear regression
oThis does a better job of capturing trends at the outliers, but is
still discontinuous
owe’re left with the question of how to choose a good value for k
• (d) locally weighted regression with a quadratic kernel of
width k =10
2.Ensemble Learning
• To select a collection, or ensemble, of hypotheses from the
hypothesis space and combine their predictions
• Ex.
oDuring cross-validation we might generate twenty different
decision trees, and have them vote on the best classification for
a new example
• The ensemble method provides a way to learn a much more
expressive class of hypotheses
• Three linear threshold hypotheses
• Each of which classifies positively
on the unshaded side
Boosting
• The most widely used ensemble method
• weighted training set
oIn such a training set, each example has an associated weight
wj ≥ 0.
oBoosting starts with wj =1 for all the examples
ogenerates the first hypothesis, h1
oThis hypothesis will classify some of the training examples
correctly and some incorrectly
oThe next hypothesis to do better on the misclassified examples
owe increase their weights while decreasing the weights of the
correctly classified examples
oFrom this new weighted training set, we generate hypothesis h2
ocontinues in this way until we have generated K hypotheses
Boosting
• How the boosting algorithm works
oEach shaded rectangle corresponds to an example
o the height of the rectangle corresponds to the weight
oThe checks and crosses indicate whether the example was
classified correctly by the current hypothesis
oThe size of the decision tree indicates the weight of that
hypothesis in the final ensemble.
3. Practical Machine Learning
• Introduced a wide range of machine learning techniques
• Simple learning tasks
• Consider two aspects of practical machine learning
ofinding algorithms capable of learning to recognize handwritten
digits and squeezing every last drop of predictive performance
out of them’
opointing out that obtaining, cleaning, and representing the data
can be at least as important as algorithm engineering
Case study: Handwritten digit recognition
• Example for machine learning problem
Learning Probabilistic Model
2017.07.06(Fri)
Junya Tanaka
Introduction
• We view learning as a form of uncertain reasoning from
observations
• First we must learn their probabilistic theories of the world
from experience
oThis chapter explains how they can do that, by formulating the
learning task itself as a process of probabilistic inference
oBayesian learning is extremely powerful solutions to the
problems of noise, overfitting, and optimal prediction
1. Statistical Learning
• The key concepts are data and hypotheses
oData are evidence
oInstantiations of some or all of the random variables describing
the domain
• Bayesian learning simply calculates the probability of each
hypothesis, given the data, and makes predictions
oPredictions are made by using all the hypotheses, weighted by
their probabilities
oLearning is reduced to probabilistic inference.
1. Statistical Learning
• Let D represent all the data, with observed value d; then the
probability of each hypothesis is obtained by Bayes’ rule
• Suppose we want to make a prediction about an unknown
quantity X. Then we have
• where we have assumed that each hypothesis determines a
probability distribution over X
• This equation shows that predictions are weighted averages
over the predictions of the individual hypotheses
1. Statistical Learning
• Example
oSome candy comes in two flavors cherry(yum) and lime (ugh)
owraps each piece of candy in the same opaque wrapper,
regardless of flavor
oThe candy is indistinguishable from the outside
oGiven a new bag of candy, the random variable H denotes the
type of the bag, with possible values h1 through h5
oD1, D2, . . ., DN, where each Di is a random variable with
possible values cherry and lime
oThe basic task faced by the agent is to predict the flavor of the
next piece of candy
1. Statistical Learning
• Apply Bayesian Learning to Candy Example
oThe prior distribution over h1, . . . , h5 is given by ⟨0.1, 0.2, 0.4,
0.2, 0.1⟩
oThe likelihood of the data
oSuppose the bag is really an all-lime bag (h5) and the first 10
candies are all lime; then P(d | h3) is 0.510, because half the
candies in an h3 bag are lime
1. Statistical Learning
• (a) Posterior probabilities P(hi | d1, . . . , dN)
oThe number of observations N ranges from 1 to 10, and each
observation is of a lime candy
oThe probabilities start out at their prior values
• (b) Bayesian prediction P(dN+1 =lime | d1, . . . , dN)
opredicted probability that the next candy is lime
1. Statistical Learning
• The example shows that the Bayesian prediction eventually
agrees with the true hypothesis
• Bayesian prediction is optimal, whether the data set be small
or large
• Given the hypothesis prior, any other prediction is expected
to be correct less often
2. Learning with Complete Data
• Density estimation
oThe general task of learning a probability model, given data that
are assumed to be generated from that model
• This section covers the simplest case, where we have
complete data
oData are complete when each data point contains values for
every variable in the probability model being learned
• Parameter learning
oFinding the numerical parameters for a probability model whose
structure is fixed
Maximum-likelihood parameter learning: Discrete models
• Suppose we buy a bag of lime and cherry candy that lime–
cherry proportions are completely unknown
oThe parameter in this case, which we call θ, is the proportion of
cherry candies, and the hypothesis is hθ
oHas values cherry and lime, where the probability of cherry is θ
oNow suppose we unwrap N candies, of which c are cherries
and ℓ=N − c are limes
othis particular data set is
oThe maximum-likelihood hypothesis is given by the value of θ
that maximizes this expression
Maximum-likelihood parameter learning: Discrete models
• To find the maximum-likelihood value of θ, we differentiate L
with respect to θ and set the resulting expression to zero:
• 1. Write down an expression for the likelihood of the data as
a function of the parameter(s).
• 2. Write down the derivative of the log likelihood with respect
to each parameter.
• 3. Find the parameter values such that the derivatives are
zero.
Naive Bayes models
• Most common Bayesian network model used in machine
learning
• the “class” variable C (which is to be predicted) is the root
and the “attribute” variables Xi are the leaves
• With observed attribute values x1, . . . , xn, the probability of
each class is given by
Maximum-likelihood parameter learning: Continuous models
• Continuous variables are ubiquitous in real-world
applications
• The principles for maximum-likelihood learning are identical
in the continuous and discrete cases
• Simple case
olearning the parameters of a Gaussian density function
othe data are generated as follows
oLet the observed values be x1, . . . , xN,then log likelihood is
Maximum-likelihood parameter learning: Continuous models
• Now consider a linear Gaussian model with one continuous
parent X and a continuous child Y
• Y has a Gaussian distribution whose mean depends linearly
on the value of X and whose standard deviation is fixed
• To learn the conditional distribution P(Y |X), we can
maximize the conditional likelihood
• the log likelihood with respect to these parameters is the
same as minimizing the numerator(y − (θ1x + θ2))
• Now we can understand why: minimizing the sum of squared
errors gives the maximum-likelihood straight-line model,
provided that the data are generated with Gaussian noise of
fixed variance
Maximum-likelihood parameter learning: Continuous models
Bayesian parameter learning
• The hypothesis prior
oThe Bayesian approach to parameter learning starts by defining
a prior probability distribution over the possible hypotheses
• In the Bayesian view, θ is the (unknown) value of a random
variable Θ that defines the hypothesis space
• The hypothesis prior is just the prior distribution P(Θ)
• If the parameter θ can be value between 0 and 1, then P(Θ)
must be a continuous distribution that is nonzero only
between 0 and 1 and that integrates to 1
• the uniform density is a member of the family of beta
distributions
• Each beta distribution is defined by two hyperparameters a
and b such that
Bayesian parameter learning
• Figure shows what the distribution looks like for various
values of a and b
Bayesian parameter learning
• Let’s see how this works. Suppose we observe a cherry
candy;
Density estimation with nonparametric models
• The task of nonparametric density estimation is typically
done in continuous domains, such as that shown in figure
• The figure shows a probability density function on a space
defined by two continuous variables
• we see a sample of data points from this density function
• can we recover the model from the samples
Density estimation with nonparametric models
• Consider k-nearest-neighbors models
oGiven a sample of data points, to estimate the unknown
probability density at a query point x we can simply measure
the density of the data points in the neighborhood of x.
oFigure shows two query points (small squares)
oThe central circle is large, meaning there is a low density there
oThe circle on the right is small, meaning there is a high density
there
Density estimation with nonparametric models
• Three plots of density estimation using k-nearest-neighbors
for different values of k applied to the data in previous Fig
o(b) is about right(k is midium, k = 10), while (a) is too spiky (k is
too small, k = 3) and (c) is too smooth (k is too big, k = 40)
3. Learning with Hidden Variables : EM Algorithm
• Many real-world problems have hidden variables (sometimes
called latent variables)
• Hidden variables are important, but they do complicate the
learning problem
oIt is not obvious how to learn the conditional distribution for
HeartDisease, given its parents, because we do not know the
value of HeartDisease in each case; the same problem arises
in learning the distributions for the symptoms.
• EM algorithm can solve the problems
Unsupervised clustering: Learning mixtures of Gaussians
• The problem of discerning multiple categories in a collection
of objects
• Example
osuppose we record the spectra of a hundred thousand stars
ohow many types and what are their characteristics?
oThe data points might correspond to stars
oThe attributes might correspond to spectral intensities at two
particular frequencies.
• Next, we need to understand what kind of probability
distribution might have generated the data
• Such a distribution has k components based on a mixture
distribution P, each of which is a distribution in its own right
Unsupervised clustering: Learning mixtures of Gaussians
• Let the random variable C denote the component, with
values 1, . . . , k; then the mixture distribution is given by
Learning hidden Markov models
• hidden Markov model can be represented by a dynamic
Bayes net with a single discrete state variable
• the problem is to learn the transition probabilities from a set
of observation sequences
• To estimate the transition probability from state i to state j
• we simply calculate the expected proportion of times that the system
undergoes a transition to state j when in state i:
The general form of the EM algorithm
• Let x be all the observed values in all the examples, let Z
denote all the hidden variables for all the examples, and let θ
be all the parameters for the probability model.
• Then the EM algorithm is
• The E-step is the computation of the summation, which is the
expectation of the log likelihood of the “completed” data with
respect to the distribution P(Z=z | x, θ(i))
• The M-step is the maximization of this expected log
likelihood with respect to the parameters.

More Related Content

What's hot

Lecture9 - Bayesian-Decision-Theory
Lecture9 - Bayesian-Decision-TheoryLecture9 - Bayesian-Decision-Theory
Lecture9 - Bayesian-Decision-TheoryAlbert Orriols-Puig
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)surbhi jha
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes ClassifierArunabha Saha
 
Machine Learning
Machine LearningMachine Learning
Machine LearningShrey Malik
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 
Machine Learning
Machine LearningMachine Learning
Machine LearningVivek Garg
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...Lê Anh Đạt
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?Kazuki Yoshida
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 

What's hot (20)

Lecture9 - Bayesian-Decision-Theory
Lecture9 - Bayesian-Decision-TheoryLecture9 - Bayesian-Decision-Theory
Lecture9 - Bayesian-Decision-Theory
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Random forest
Random forestRandom forest
Random forest
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
First order logic
First order logicFirst order logic
First order logic
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Naive Bayes Classifier
Naive Bayes ClassifierNaive Bayes Classifier
Naive Bayes Classifier
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Stuart russell and peter norvig artificial intelligence - a modern approach...
Stuart russell and peter norvig   artificial intelligence - a modern approach...Stuart russell and peter norvig   artificial intelligence - a modern approach...
Stuart russell and peter norvig artificial intelligence - a modern approach...
 
Classification metrics
Classification metrics Classification metrics
Classification metrics
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 

Similar to Machine Learning Models and Probabilistic Inference

ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
probability.pptx
probability.pptxprobability.pptx
probability.pptxbisan3
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewChia-Ching Lin
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4rajmohanc
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Christian Robert
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learningkensaleste
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Sherri Gunder
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptxGaytriDhingra1
 

Similar to Machine Learning Models and Probabilistic Inference (20)

MLE.pdf
MLE.pdfMLE.pdf
MLE.pdf
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
probability.pptx
probability.pptxprobability.pptx
probability.pptx
 
2주차
2주차2주차
2주차
 
KNN
KNNKNN
KNN
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
 
Machine learning
Machine learningMachine learning
Machine learning
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
4646150.ppt
4646150.ppt4646150.ppt
4646150.ppt
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Machine learning interviews day4
Machine learning interviews   day4Machine learning interviews   day4
Machine learning interviews day4
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 

Recently uploaded

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 

Recently uploaded (20)

Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 

Machine Learning Models and Probabilistic Inference

  • 1. Learning from Example and Learning Probabilistic Model 2017.07.06(Fri) Junya Tanaka
  • 2. Agenda • Learning from Example oNonparametric Model oEnsemble Learning oPractical Machine Learning • Learning Probabilistic Model oStatistical Learning oLearning with Complete Data oLearning with Hidden Variable : The EM Algorithm
  • 4. 1.Nonparametric Models • Parametric Model oA learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) oNo matter how much data you throw at a parametric model oWhen data sets are small, it makes sense to have a strong restriction on the allowable hypotheses, to avoid overfitting • Nonparametric Model oA learning model that cannot be characterized by a bounded set of parameters oThe problem with this method is that it does not generalize well oeach hypothesis we generate simply retains within itself all of the training examples and uses all of them to predict the next example
  • 5. Nearest neighbor models • given a query xq, find the k examples that are nearest to xq • Use the notation NN(k, xq) to denote the set of k nearest neighbors • To do classification ofirst find NN(k, xq) otake the plurality vote of the neighbors • the decision boundary of k-nearest-neighbors classification for k= 1 and 5 on the earthquake data o1-nearest neighbors is overfitting o5-nearest-neighbors is good
  • 6. Distance • The very word “nearest” implies a distance metric. • Distances are measured with a Minkowski distance or p norm, defined as op=2 is Euclidean distance op=1 is Manhattan distance oWith Boolean attribute values, the number of attributes on which the two points differ is called the Hamming distance • if we change dimension I, we’ll get different nearest neighbors while keeping the other dimensions the same • To avoid this, it is common to apply normalization to the measurements
  • 7. Curse of dimensionality • Nearest neighbors works very well in low-dimensional data • Computational complexity becomes enormous if data dimensions increases • Generally, reduce the number of actual distance calculations oK-d tree oLocally sensitive hashing
  • 8. Finding nearest neighbors with k-d trees • k-d tree (k-dimensional tree) oA balanced binary tree over data with an arbitrary number of dimensions • The construction of a k-d tree is similar to the construction of a one-dimensional balanced binary tree • Construction oAs one moves down the tree, one cycles through the axes used to select the splitting planes oPoints are inserted by selecting the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.
  • 9. Locality-sensitive hashing • Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data • LSH hashes input items so that similar items map to the same “buckets” with high probability • LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items • Locality-sensitive hashing has much in common with data clustering and nearest neighbor search
  • 10. Nonparametric regression • Nonparametric approaches to regression • (a)connect the dots oPiecewise linear nonparametric regression • (b)3-nearest neighbors average ok-nearest-neighbors regression
  • 11. Nonparametric regression • (c) 3-nearest-neighbors linear regression oThis does a better job of capturing trends at the outliers, but is still discontinuous owe’re left with the question of how to choose a good value for k • (d) locally weighted regression with a quadratic kernel of width k =10
  • 12. 2.Ensemble Learning • To select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions • Ex. oDuring cross-validation we might generate twenty different decision trees, and have them vote on the best classification for a new example • The ensemble method provides a way to learn a much more expressive class of hypotheses • Three linear threshold hypotheses • Each of which classifies positively on the unshaded side
  • 13. Boosting • The most widely used ensemble method • weighted training set oIn such a training set, each example has an associated weight wj ≥ 0. oBoosting starts with wj =1 for all the examples ogenerates the first hypothesis, h1 oThis hypothesis will classify some of the training examples correctly and some incorrectly oThe next hypothesis to do better on the misclassified examples owe increase their weights while decreasing the weights of the correctly classified examples oFrom this new weighted training set, we generate hypothesis h2 ocontinues in this way until we have generated K hypotheses
  • 14. Boosting • How the boosting algorithm works oEach shaded rectangle corresponds to an example o the height of the rectangle corresponds to the weight oThe checks and crosses indicate whether the example was classified correctly by the current hypothesis oThe size of the decision tree indicates the weight of that hypothesis in the final ensemble.
  • 15. 3. Practical Machine Learning • Introduced a wide range of machine learning techniques • Simple learning tasks • Consider two aspects of practical machine learning ofinding algorithms capable of learning to recognize handwritten digits and squeezing every last drop of predictive performance out of them’ opointing out that obtaining, cleaning, and representing the data can be at least as important as algorithm engineering
  • 16. Case study: Handwritten digit recognition • Example for machine learning problem
  • 18. Introduction • We view learning as a form of uncertain reasoning from observations • First we must learn their probabilistic theories of the world from experience oThis chapter explains how they can do that, by formulating the learning task itself as a process of probabilistic inference oBayesian learning is extremely powerful solutions to the problems of noise, overfitting, and optimal prediction
  • 19. 1. Statistical Learning • The key concepts are data and hypotheses oData are evidence oInstantiations of some or all of the random variables describing the domain • Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions oPredictions are made by using all the hypotheses, weighted by their probabilities oLearning is reduced to probabilistic inference.
  • 20. 1. Statistical Learning • Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule • Suppose we want to make a prediction about an unknown quantity X. Then we have • where we have assumed that each hypothesis determines a probability distribution over X • This equation shows that predictions are weighted averages over the predictions of the individual hypotheses
  • 21. 1. Statistical Learning • Example oSome candy comes in two flavors cherry(yum) and lime (ugh) owraps each piece of candy in the same opaque wrapper, regardless of flavor oThe candy is indistinguishable from the outside oGiven a new bag of candy, the random variable H denotes the type of the bag, with possible values h1 through h5 oD1, D2, . . ., DN, where each Di is a random variable with possible values cherry and lime oThe basic task faced by the agent is to predict the flavor of the next piece of candy
  • 22. 1. Statistical Learning • Apply Bayesian Learning to Candy Example oThe prior distribution over h1, . . . , h5 is given by ⟨0.1, 0.2, 0.4, 0.2, 0.1⟩ oThe likelihood of the data oSuppose the bag is really an all-lime bag (h5) and the first 10 candies are all lime; then P(d | h3) is 0.510, because half the candies in an h3 bag are lime
  • 23. 1. Statistical Learning • (a) Posterior probabilities P(hi | d1, . . . , dN) oThe number of observations N ranges from 1 to 10, and each observation is of a lime candy oThe probabilities start out at their prior values • (b) Bayesian prediction P(dN+1 =lime | d1, . . . , dN) opredicted probability that the next candy is lime
  • 24. 1. Statistical Learning • The example shows that the Bayesian prediction eventually agrees with the true hypothesis • Bayesian prediction is optimal, whether the data set be small or large • Given the hypothesis prior, any other prediction is expected to be correct less often
  • 25. 2. Learning with Complete Data • Density estimation oThe general task of learning a probability model, given data that are assumed to be generated from that model • This section covers the simplest case, where we have complete data oData are complete when each data point contains values for every variable in the probability model being learned • Parameter learning oFinding the numerical parameters for a probability model whose structure is fixed
  • 26. Maximum-likelihood parameter learning: Discrete models • Suppose we buy a bag of lime and cherry candy that lime– cherry proportions are completely unknown oThe parameter in this case, which we call θ, is the proportion of cherry candies, and the hypothesis is hθ oHas values cherry and lime, where the probability of cherry is θ oNow suppose we unwrap N candies, of which c are cherries and ℓ=N − c are limes othis particular data set is oThe maximum-likelihood hypothesis is given by the value of θ that maximizes this expression
  • 27. Maximum-likelihood parameter learning: Discrete models • To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero: • 1. Write down an expression for the likelihood of the data as a function of the parameter(s). • 2. Write down the derivative of the log likelihood with respect to each parameter. • 3. Find the parameter values such that the derivatives are zero.
  • 28. Naive Bayes models • Most common Bayesian network model used in machine learning • the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves • With observed attribute values x1, . . . , xn, the probability of each class is given by
  • 29. Maximum-likelihood parameter learning: Continuous models • Continuous variables are ubiquitous in real-world applications • The principles for maximum-likelihood learning are identical in the continuous and discrete cases • Simple case olearning the parameters of a Gaussian density function othe data are generated as follows oLet the observed values be x1, . . . , xN,then log likelihood is
  • 30. Maximum-likelihood parameter learning: Continuous models • Now consider a linear Gaussian model with one continuous parent X and a continuous child Y • Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed • To learn the conditional distribution P(Y |X), we can maximize the conditional likelihood • the log likelihood with respect to these parameters is the same as minimizing the numerator(y − (θ1x + θ2)) • Now we can understand why: minimizing the sum of squared errors gives the maximum-likelihood straight-line model, provided that the data are generated with Gaussian noise of fixed variance
  • 32. Bayesian parameter learning • The hypothesis prior oThe Bayesian approach to parameter learning starts by defining a prior probability distribution over the possible hypotheses • In the Bayesian view, θ is the (unknown) value of a random variable Θ that defines the hypothesis space • The hypothesis prior is just the prior distribution P(Θ) • If the parameter θ can be value between 0 and 1, then P(Θ) must be a continuous distribution that is nonzero only between 0 and 1 and that integrates to 1 • the uniform density is a member of the family of beta distributions • Each beta distribution is defined by two hyperparameters a and b such that
  • 33. Bayesian parameter learning • Figure shows what the distribution looks like for various values of a and b
  • 34. Bayesian parameter learning • Let’s see how this works. Suppose we observe a cherry candy;
  • 35. Density estimation with nonparametric models • The task of nonparametric density estimation is typically done in continuous domains, such as that shown in figure • The figure shows a probability density function on a space defined by two continuous variables • we see a sample of data points from this density function • can we recover the model from the samples
  • 36. Density estimation with nonparametric models • Consider k-nearest-neighbors models oGiven a sample of data points, to estimate the unknown probability density at a query point x we can simply measure the density of the data points in the neighborhood of x. oFigure shows two query points (small squares) oThe central circle is large, meaning there is a low density there oThe circle on the right is small, meaning there is a high density there
  • 37. Density estimation with nonparametric models • Three plots of density estimation using k-nearest-neighbors for different values of k applied to the data in previous Fig o(b) is about right(k is midium, k = 10), while (a) is too spiky (k is too small, k = 3) and (c) is too smooth (k is too big, k = 40)
  • 38. 3. Learning with Hidden Variables : EM Algorithm • Many real-world problems have hidden variables (sometimes called latent variables) • Hidden variables are important, but they do complicate the learning problem oIt is not obvious how to learn the conditional distribution for HeartDisease, given its parents, because we do not know the value of HeartDisease in each case; the same problem arises in learning the distributions for the symptoms. • EM algorithm can solve the problems
  • 39. Unsupervised clustering: Learning mixtures of Gaussians • The problem of discerning multiple categories in a collection of objects • Example osuppose we record the spectra of a hundred thousand stars ohow many types and what are their characteristics? oThe data points might correspond to stars oThe attributes might correspond to spectral intensities at two particular frequencies. • Next, we need to understand what kind of probability distribution might have generated the data • Such a distribution has k components based on a mixture distribution P, each of which is a distribution in its own right
  • 40. Unsupervised clustering: Learning mixtures of Gaussians • Let the random variable C denote the component, with values 1, . . . , k; then the mixture distribution is given by
  • 41. Learning hidden Markov models • hidden Markov model can be represented by a dynamic Bayes net with a single discrete state variable • the problem is to learn the transition probabilities from a set of observation sequences • To estimate the transition probability from state i to state j • we simply calculate the expected proportion of times that the system undergoes a transition to state j when in state i:
  • 42. The general form of the EM algorithm • Let x be all the observed values in all the examples, let Z denote all the hidden variables for all the examples, and let θ be all the parameters for the probability model. • Then the EM algorithm is • The E-step is the computation of the summation, which is the expectation of the log likelihood of the “completed” data with respect to the distribution P(Z=z | x, θ(i)) • The M-step is the maximization of this expected log likelihood with respect to the parameters.

Editor's Notes

  1. アンサンブル学習によって得られた表現力の向上の図。3つの線形閾値仮説を採り、それぞれが陰影のない側を正に分類し、3つすべてによって正に分類された例を陽性として分類する。結果の三角領域は、元の仮説空間では表現できない仮説である。 Illustration of the increased expressive power obtained by ensemble learning.We take three linear threshold hypotheses, each of which classifies positively on the unshaded side, and classify as positive any example classified positively by all three. The resulting triangular region is a hypothesis not expressible in the original hypothesis space.
  2. 弱学習機(weak learner---random guessより少し良い程度 の能力)から性能の良い学習機を作り上げる. • 基本的アイディアは,学習データの重みを変えて何回も 学習する.前回に誤識別を起こしたデータの重みを大きく して再学習をする.これを繰り返すと,性能の良い学習機 が可能になる. • 初期の学習機(識別関数)は学習データ全体を対象とす るが,後期の学習機は識別の困難なデータに特化した学 習機となる.
  3. Notice that the probabilities start out at their prior values, so h3 is initially the most likely choice and remains so after 1 lime candy is unwrapped. After 2 lime candies are unwrapped, h4 is most likely; after 3 or more, h5 (the dreaded all-lime bag) is the most likely. After 10 in a row, we are fairly certain of our fate.
  4. 分布の平均値はa /(a + b)であるので、aの値が大きいほど、Θは0よりも1に近いという信念が示唆される。a + bの値が大きいほど、分布はよりピークになり、 Θの したがって、ベータファミリーは、前の仮説に対して有用な範囲の可能性を提供する ベータ・ファミリには、柔軟性に加えて、もう1つの素晴らしい特性があります。Θが事前ベータ[a、b]を持つ場合、データ・ポイントが観測された後、Θの事後分布もベータ分布になります。 言い換えれば、ベータ版は更新中に終了します。 ベータのファミリは、ブール変数の分布ファミリの前にコンジュゲートと呼ばれます.4これがどのように機能するか見てみましょう。
  5. 教師なしクラスタリングの問題は、図20.11(a)のような混合モデルを図20.11(b)のような生データから復元することである。 明らかに、各データポイントを生成したコンポーネントが分かっていれば、コンポーネントGaussiansを復元するのは簡単です。特定のコンポーネントからすべてのデータポイントを選択し、式(20.4)(ページの多変量バージョン)を適用するだけです 809)を使用して、一連のデータにガウス分布のパラメータをフィッティングする。 一方、各コンポーネントのパラメータを知っていれば、少なくとも確率的な意味で各データポイントをコンポーネントに割り当てることができます。 問題は、割り当てやパラメータがわからないことです。