Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
This slide will try to communicate via pictures, instead of going technical mumbo-jumbo. We might go somewhere but slide is full of pictures. If you dont understand any part of it, let me know.
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
This slide will try to communicate via pictures, instead of going technical mumbo-jumbo. We might go somewhere but slide is full of pictures. If you dont understand any part of it, let me know.
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
List of top Machine Learning algorithms are making headway in the world of data science. Explained here are the top 10 of these machine learning algorithms - https://www.dezyre.com/article/top-10-machine-learning-algorithms/202
In the past few years, India has witnessed exponential growth in the sector of Data Science. With the advent of digital transformation in businesses, the demand for data scientists is boosting every day with a ton of job opportunities machine learning course in mumbai’machine learning course in mumbais lying in their path. Boston Institute of Analytics provides data science courses in Mumbai. They train students under experienced industry professionals and make them industry ready. To know more about their courses check out their website https://www.biaclassroom.com/courses.
List of top Machine Learning algorithms are making headway in the world of data science. Explained here are the top 10 of these machine learning algorithms - https://www.dezyre.com/article/top-10-machine-learning-algorithms/202
In the past few years, India has witnessed exponential growth in the sector of Data Science. With the advent of digital transformation in businesses, the demand for data scientists is boosting every day with a ton of job opportunities machine learning course in mumbai’machine learning course in mumbais lying in their path. Boston Institute of Analytics provides data science courses in Mumbai. They train students under experienced industry professionals and make them industry ready. To know more about their courses check out their website https://www.biaclassroom.com/courses.
This slide gives brief overview of supervised, unsupervised and reinforcement learning. Algorithms discussed are Naive Bayes, K nearest neighbour, SVM,decision tree, Markov model.
Difference between regression and classification. difference between supervised and reinforcement, iterative functioning of Markov model and machine learning applications.
Basics of machine learning. Fundamentals of machine learning. These slides are collected from different learning materials and organized into one slide set.
Machine Learning 2 deep Learning: An IntroSi Krishan
Provides a brief introduction to machine learning, reasons for its popularity, a simple walk through example and then a need for deep learning and some of its characteristics. This is an updated version of an earlier presentation.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
1. CSC2515 Fall 2007
Introduction to Machine Learning
Lecture 1: What is Machine Learning?
All lecture slides will be available as .ppt, .ps, & .htm at
www.cs.toronto.edu/~hinton
Many of the figures are provided by Chris Bishop
from his textbook: ”Pattern Recognition and Machine Learning”
2. What is Machine Learning?
• It is very hard to write programs that solve problems like
recognizing a face.
– We don’t know what program to write because we don’t
know how our brain does it.
– Even if we had a good idea about how to do it, the
program might be horrendously complicated.
• Instead of writing a program by hand, we collect lots of
examples that specify the correct output for a given input.
• A machine learning algorithm then takes these examples
and produces a program that does the job.
– The program produced by the learning algorithm may
look very different from a typical hand-written program. It
may contain millions of numbers.
– If we do it right, the program works for new cases as well
as the ones we trained it on.
3. A classic example of a task that requires machine
learning: It is very hard to say what makes a 2
4. Some more examples of tasks that are best
solved by using a learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear
power plant or unusual sound in your car engine.
• Prediction:
– Future stock prices or currency exchange rates
5. Some web-based examples of machine learning
• The web contains a lot of data. Tasks with very big
datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
• Data Visualization:
– Display a huge database in a revealing way (demo)
6. Displaying the structure of a set of documents
using Latent Semantic Analysis (a form of PCA)
Each document is converted
to a vector of word counts.
This vector is then mapped to
two coordinates and displayed
as a colored dot. The colors
represent the hand-labeled
classes.
When the documents are laid
out in 2-D, the classes are not
used. So we can judge how
good the algorithm is by
seeing if the classes are
separated.
8. Machine Learning & Symbolic AI
• Knowledge Representation works with facts/assertions and
develops rules of logical inference. The rules can handle
quantifiers. Learning and uncertainty are usually ignored.
• Expert Systems used logical rules or conditional
probabilities provided by “experts” for specific domains.
• Graphical Models treat uncertainty properly and allow
learning (but they often ignore quantifiers and use a fixed
set of variables)
– Set of logical assertions values of a subset of the
variables and local models of the probabilistic
interactions between variables.
– Logical inference probability distributions over subsets
of the unobserved variables (or individual ones)
– Learning = refining the local models of the interactions.
9. Machine Learning & Statistics
• A lot of machine learning is just a rediscovery of things
that statisticians already knew. This is often disguised by
differences in terminology:
– Ridge regression = weight-decay
– Fitting = learning
– Held-out data = test data
• But the emphasis is very different:
– A good piece of statistics: Clever proof that a
relatively simple estimation procedure is
asymptotically unbiased.
– A good piece of machine learning: Demonstration that
a complicated algorithm produces impressive results
on a specific task.
• Data-mining: Using very simple machine learning
techniques on very large databases because computers
are too slow to do anything more interesting with ten
billion examples.
10. A spectrum of machine learning tasks
• Low-dimensional data (e.g.
less than 100 dimensions)
• Lots of noise in the data
• There is not much structure in
the data, and what structure
there is, can be represented by
a fairly simple model.
• The main problem is
distinguishing true structure
from noise.
• High-dimensional data (e.g.
more than 100 dimensions)
• The noise is not sufficient to
obscure the structure in the
data if we process it right.
• There is a huge amount of
structure in the data, but the
structure is too complicated to
be represented by a simple
model.
• The main problem is figuring
out a way to represent the
complicated structure that
allows it to be learned.
Statistics---------------------Artificial Intelligence
11. Types of learning task
• Supervised learning
– Learn to predict output when given an input vector
• Who provides the correct answer?
• Reinforcement learning
– Learn action to maximize payoff
• Not much information in a payoff signal
• Payoff is often delayed
– Reinforcement learning is an important area that will not
be covered in this course.
• Unsupervised learning
– Create an internal representation of the input e.g. form
clusters; extract features
• How do we know if a representation is good?
– This is the new frontier of machine learning because
most big datasets do not come with labels.
12. Hypothesis Space
• One way to think about a supervised learning machine is as a
device that explores a “hypothesis space”.
– Each setting of the parameters in the machine is a different
hypothesis about the function that maps input vectors to output
vectors.
– If the data is noise-free, each training example rules out a region
of hypothesis space.
– If the data is noisy, each training example scales the posterior
probability of each point in the hypothesis space in proportion to
how likely the training example is given that hypothesis.
• The art of supervised machine learning is in:
– Deciding how to represent the inputs and outputs
– Selecting a hypothesis space that is powerful enough to
represent the relationship between inputs and outputs but simple
enough to be searched.
13. Searching a hypothesis space
• The obvious method is to first formulate a loss function
and then adjust the parameters to minimize the loss
function.
– This allows the optimization to be separated from the
objective function that is being optimized.
• Bayesians do not search for a single set of parameter
values that do well on the loss function.
– They start with a prior distribution over parameter
values and use the training data to compute a
posterior distribution over the whole hypothesis
space.
14. Some Loss Functions
• Squared difference between actual and target real-
valued outputs.
• Number of classification errors
– Problematic for optimization because the derivative is
not smooth.
• Negative log probability assigned to the correct answer.
– This is usually the right function to use.
– In some cases it is the same as squared error
(regression with Gaussian output noise)
– In other cases it is very different (classification with
discrete classes needs cross-entropy error)
15. Generalization
• The real aim of supervised learning is to do well on test
data that is not known during learning.
• Choosing the values for the parameters that minimize
the loss function on the training data is not necessarily
the best policy.
• We want the learning machine to model the true
regularities in the data and to ignore the noise in the
data.
– But the learning machine does not know which
regularities are real and which are accidental quirks of
the particular set of training examples we happen to
pick.
• So how can we be sure that the machine will generalize
correctly to new data?
16. Trading off the goodness of fit against the
complexity of the model
• It is intuitively obvious that you can only expect a model to
generalize well if it explains the data surprisingly well given
the complexity of the model.
• If the model has as many degrees of freedom as the data, it
can fit the data perfectly but so what?
• There is a lot of theory about how to measure the model
complexity and how to control it to optimize generalization.
– Some of this “learning theory” will be covered later in the
course, but it requires a whole course on learning theory
to cover it properly (Toni Pitassi sometimes offers such a
course).
17. A sampling assumption
• Assume that the training examples are drawn
independently from the set of all possible examples.
• Assume that each time a training example is drawn, it
comes from an identical distribution (i.i.d)
• Assume that the test examples are drawn in exactly the
same way – i.i.d. and from the same distribution as the
training data.
• These assumptions make it very unlikely that a strong
regularity in the training data will be absent in the test
data.
– Can we say something more specific?
18. The probabilistic guarantee
where N = size of training set
h = VC dimension of the model class = complexity
p = upper bound on probability that this bound fails
So if we train models with different complexity, we should
pick the one that minimizes this bound
Actually, this is only sensible if we think the bound is
fairly tight, which it usually isn’t. The theory provides
insight, but in practice we still need some witchcraft.
2
1
)
4
/
log(
)
/
2
log(
N
p
h
N
h
h
E
E train
test
19. A simple example: Fitting a polynomial
• The green curve is the true
function (which is not a
polynomial)
• The data points are uniform in
x but have noise in y.
• We will use a loss function
that measures the squared
error in the prediction of y(x)
from x. The loss for the red
polynomial is the sum of the
squared vertical errors.
from Bishop
20. Some fits to the data: which is best?
from Bishop
21. A simple way to reduce model complexity
• If we penalize polynomials that have big values for their
coefficients, we will get less wiggly solutions:
2
1
||
||
}
)
,
(
{
)
(
~
2
2
2
1
w
w
w
n
n t
x
y
E
N
n
regularization
parameter
target value
penalized loss
function
from Bishop
24. Using a validation set
• Divide the total dataset into three subsets:
– Training data is used for learning the
parameters of the model.
– Validation data is not used of learning but is
used for deciding what type of model and
what amount of regularization works best.
– Test data is used to get a final, unbiased
estimate of how well the network works. We
expect this estimate to be worse than on the
validation data.
• We could then re-divide the total dataset to get
another unbiased estimate of the true error rate.
25. The Bayesian framework
• The Bayesian framework assumes that we always
have a prior distribution for everything.
– The prior may be very vague.
– When we see some data, we combine our prior
distribution with a likelihood term to get a posterior
distribution.
– The likelihood term takes into account how
probable the observed data is given the parameters
of the model.
• It favors parameter settings that make the data likely.
• It fights the prior
• With enough data the likelihood terms always win.
26. A coin tossing example
• Suppose we know nothing about coins except that each
tossing event produces a head with some unknown
probability p and a tail with probability 1-p. Our model of
a coin has one parameter, p.
• Suppose we observe 100 tosses and there are 53
heads. What is p?
• The frequentist answer: Pick the value of p that makes
the observation of 53 heads and 47 tails most probable.
53
.
0
)
1
(
1
47
53
)
1
(
47
)
1
(
53
)
(
)
1
(
)
(
47
53
46
53
47
52
47
53
p
if
p
p
p
p
p
p
p
p
dp
D
dP
p
p
D
P probability of a particular sequence
27. Some problems with picking the parameters
that are most likely to generate the data
• What if we only tossed the coin once and we got
1 head?
– Is p=1 a sensible answer?
• Surely p=0.5 is a much better answer.
• Is it reasonable to give a single answer?
– If we don’t have much data, we are unsure
about p.
– Our computations of probabilities will work
much better if we take this uncertainty into
account.
28. Using a distribution over parameter values
• Start with a prior distribution
over p. In this case we used a
uniform distribution.
• Multiply the prior probability of
each parameter value by the
probability of observing a head
given that value.
• Then scale up all of the
probability densities so that
their integral comes to 1. This
gives the posterior distribution.
probability
density
p
area=1
area=1
0 1
1
1
2
probability
density
probability
density
29. Lets do it again: Suppose we get a tail
• Start with a prior
distribution over p.
• Multiply the prior
probability of each
parameter value by the
probability of observing a
tail given that value.
• Then renormalize to get
the posterior distribution.
Look how sensible it is!
probability
density
p
area=1
area=1
0 1
1
2
30. Lets do it another 98 times
• After 53 heads and 47
tails we get a very
sensible posterior
distribution that has its
peak at 0.53 (assuming a
uniform prior).
probability
density
p
area=1
0 1
1
2
32. A cheap trick to avoid computing the
posterior probabilities of all weight vectors
• Suppose we just try to find the most probable
weight vector.
– We can do this by starting with a random
weight vector and then adjusting it in the
direction that improves p( W | D ).
• It is easier to work in the log domain. If we want
to minimize a cost we use negative log
probabilities:
)
(
log
)
|
(
log
)
(
log
)
|
(
log
)
(
/
)
|
(
)
(
)
|
(
D
p
W
D
p
W
p
D
W
p
Cost
D
p
W
D
p
W
p
D
W
p
33. Why we maximize sums of log probs
• We want to maximize the product of the probabilities of
the outputs on the training cases
– Assume the output errors on different training cases,
c, are independent.
• Because the log function is monotonic, it does not
change where the maxima are. So we can maximize
sums of log probabilities
)
|
(
)
|
( W
d
p
W
D
p
c
c
)
|
(
log
)
|
(
log W
d
p
W
D
p
c
c
34. A even cheaper trick
• Suppose we completely ignore the prior over
weight vectors
– This is equivalent to giving all possible weight
vectors the same prior probability density.
• Then all we have to do is to maximize:
• This is called maximum likelihood learning. It is
very widely used for fitting models in statistics.
)
|
(
log
)
|
(
log W
D
p
W
D
p
c
c
35. Supervised Maximum Likelihood Learning
• Minimizing the squared
residuals is equivalent to
maximizing the log
probability of the correct
answer under a Gaussian
centered at the model’s
guess.
d = the
correct
answer
y = model’s
estimate of most
probable value
2
2
2
)
(
2
)
(
)
,
|
(
log
2
1
)
|
(
)
,
|
(
)
,
(
2
2
c
c
c
c
y
d
c
c
c
c
c
c
y
d
k
W
input
d
output
p
y
d
p
W
input
d
output
p
W
input
f
y
c
c
e
36. Supervised Maximum Likelihood Learning
• Finding a set of weights, W, that minimizes the
squared errors is exactly the same as finding a W
that maximizes the log probability that the model
would produce the desired outputs on all the
training cases.
– We implicitly assume that zero-mean Gaussian
noise is added to the model’s actual output.
– We do not need to know the variance of the
noise because we are assuming it’s the same
in all cases. So it just scales the squared error.