SlideShare a Scribd company logo
Maximum Likelihood
Likelihood
The likelihood is the probability of the data given the
model.
If we flip a coin and get a head and we think the coin is
unbiased, then the probability of observing this head is 0.5.
If we think the coin is biased so that we expect to get a head
80% of the time, then the likelihood of observing this datum (a
head) is 0.8.
The likelihood of making some observation is entirely
dependent on the model that underlies our assumption.
The datum has not changed, our model has. Therefore under
the new model the likelihood of observing the datum has
changed.
Likelihood
Maximum Likelihood (ML)
ML assumes a explicit model of sequence evolution. This is
justifiable, since molecular sequence data can be shown to
have arisen according to a stochastic process.
ML attempts to answer the question:
What is the probability that I would observe these data (a
multiple sequence alignment) given a particular model of
evolution (a tree and a process)?
Likelihood calculations
In molecular phylogenetics, the data are an alignment of sequences
We optimize parameters and branch lengths to get the maximum likelihood
Each site has a likelihood
The total likelihood is the product of the site likelihoods
The maximum likelihood tree is the tree topology that gives the highest
(optimized) likelihood under the given model.
We use reversible models, so the position of the root does not matter.
What is the probability of observing a G nucleotide?
If we have a DNA sequence of 1 nucleotide in length and the identity of this
nucleotide is G, what is the likelihood that we would observe this G?
In the same way as the coin-flipping observation, the likelihood of observing
this G is dependent on the model of sequence evolution that is thought to
underlie the data.
Model 1: frequency of G = 0.4 => likelihood(G) = 0.4
Model 2: frequency of G = 0.1 => likelihood(G) = 0.1
Model 3: frequency of G = 0.25 => likelihood(G) = 0.25
What about longer sequences?
If we consider a gene of length 2
gene 1 GA
The the probability of observing this gene is the product of the
probabilities of observing each character
Model frequency of G = 0.4 frequencyof A= 0.15
p(G) = 0.4 p(A) =0.15
Likelihood (GA) = 0.4 x 0.15 = 0.06
…or even longer sequences?
gene 1 GACTAGCTAGACAGATACGAATTAC
Model simple base frequency model
p(A)=0.15; p(C)=0.2; p(G)=0.4; p(T)=0.25;
(the sum of all probabilities must equal 1)
Likelihood (gene 1) = 0.000000000000000018452813
Note about models
You might notice that our model of base frequency is not the
optimal model for our observed data.
If we had used the following model
p(A)=0.4; p(C) =0.2; p(G)= 0.2; p(T) = 0.2;
The likelihood of observing the gene is
L (gene 1) = 0.000000000000335544320000
L (gene 1) = 0.000000000000000018452813
The datum has not changed, our model has. Therefore under
the new model the likelihood of observing the datum has
changed.
Increase in model sophistication
It is no longer possible to simply invoke a model that
encompasses base composition, we must also include the
mechanism of sequence change and stasis.
There are two parts to this model - the tree and the process
(the latter is confusingly referred to as the model, although
both parts really compose the model).
Different Branch Lengths
For very short branch lengths, the probability of a character staying the
same is high and the probability of it changing is low.
For longer branch lengths, the probability of character change becomes
higher and the probability of staying the same is lower.
The previous calculations are based on the assumption that the branch
length describes one Certain Evolutionary Distance or CED.
If we want to consider a branch length that is twice as long (2 CED), then
we can multiply the substitution matrix by itself (matrix2
).
I (A) II (C)
I (A) II (C)
v = 0.1
v = 1.0
v = µt
µ = mutation rate
t = time
ximum Likelihood
Two trees each consisting of single branch
Jukes-Cantor model
I (A) II (C)
I (A) II (C)
v = 0.1
v = 1.0
Ι AACC
ΙΙ CACT
1 j
N
1 C G G A C A C G T T T A C
2 C A G A C A C C T C T A C
3 C G G A T A A G T T A A C
4 C G G A T A G C C T A G C
1
42
3
1
C
2
C
4
G
3
A
5
6
L(j) = p
C C A G
A
A
C C A G
C
A
C C A G
T
T
+ p + … + p
L(j) = p
C C A G
A
A
C C A G
C
A
C C A G
T
T
+ p + … + p
N
L = L(1) • L(2) • … L(N) = ΠL(j)j = 1
N
lnL = lnL(1) + lnL(2) + … L(N) = Σ lnL(j)j = 1
Likelihood of the alignment at various branch lengths
0
0,00002
0,00004
0,00006
0,00008
0,0001
0,00012
0,00014
0,00016
0,00018
0,0002
0 0,1 0,2 0,3 0,4 0,5 0,6
Strengths of ML
• Does not try to make an observation of sequence change and then a
correction for superimposed substitutions. There is no need to
‘correct’ for anything, the models take care of superimposed
substitutions.
• Accurate branch lengths.
• Each site has a likelihood.
• If the model is correct, we should retrieve the correct tree (If we have
long-enough sequences and a sophisticated-enough model).
• You can use a model that fits the data.
• ML uses all the data (no selection of sites based on informativeness,
all sites are informative).
• ML can not only tell you about the phylogeny of the sequences, but
also the process of evolution that led to the observations of today’s
sequences.
Weaknesses of ML
• Can be inconsistent if we use models that are not accurate.
• Model might not be sophisticated enough
• Very computationally-intensive. Might not be possible to
examine all models (substitution matrices, tree topologies).
Models
• You can use models that:
Deal with different transition/transversion ratios.
Deal with unequal base composition.
Deal with heterogeneity of rates across sites.
Deal with heterogeneity of the substitution process (different rates
across lineages, different rates at different parts of the tree).
• The more free parameters, the better your model fits your data (good).
• The more free parameters, the higher the variance of the estimate (bad).
Choosing a Model
Don’t assume a model, rather find a model that fits your data.
Models often have “free” parameters. These can be fixed to a
reasonable value, or estimated by ML.
The more free parameters, the better the fit (higher the likelihood) of
the model to the data. (Good!)
The more free parameters, the higher the variance, and the less
power to discriminate among competing hypotheses. (Bad!)
We do not want to over-fit the model to the data
What is the best way to fit a line (a model) through these points?
How to tell if adding (or removing) a certain parameter is a good idea?
• Use statistics
• The null hypothesis is that the presence or absence of the parameter makes no difference
• In order to assess signifcance you need a null distribution
We have some DNA data, and a tree. Evaluate the data with 3 different
models.
model ln likelihood ∆
JC -2348.68
K2P -2256.73 91.95
GTR -2254.94 1.79
Evaluations with more complex models have higher likelihoods
The K2P model has 1 more parameter than the JC model
The GTR model has 4 more parameters than the K2P model
Are the extra parameters worth adding?
JC vs K2P K2P vs GTR
We have generated many true null hypothesis data sets and evaluated them under the JC
model and the K2P model. 95% of the differences are under 2.The statistic for our original
data set was 91.95, and so it is highly significant. In this case it is worthwhile to add the extra
parameter (tRatio).
We have generated many true null hypothesis data sets and evaluated them under the K2P
model and the GTR model. The statistic for our original data set was 1.79, and so it is not
signifcant. In this case it is not worthwhile to add the extra parameters.
You can use the χ2
approximation to assess
significance of adding parameters
Bayesian Inference
Maximum likelihood
Search for tree that maximizes the chance of
seeing the data (P (Data | Tree))
Bayesian Inference
Search for tree that maximizes the chance of
seeing the tree given the data (P (Tree | Data))
Bayesian Phylogenetics
Maximize the posterior probability of a tree given the aligned DNA
sequences
Two steps
- Definition of the posterior probabilities of trees (Bayes’ Rule)
- Approximation of the posterior probabilities of trees
Markov chain Monte Carlo (MCMC) methods
90 10
yesian Inference
yesian Inference
Markov Chain Monte Carlo Methods
Posterior probabilities of trees are complex joint probabilities
that cannot be calculated analytically.
Instead, the posterior probabilities of trees are approximated
with Markov Chain Monte Carlo (MCMC) methods that sample
trees from their posterior probability distribution.
MCMC
A way of sampling / touring a set of solutions,biased
by their likelihood
1 Make a random solution N1 the current solution
2 Pick another solution N2
3 If Likelihood (N1 < N2) then replace N1 with N2
4 Else if Random (Likelihood (N2) / Likelihood (N1)) then replace
N1 with N2
5 Sample (record) the current solution
6 Repeat from step 2
yesian Inference
yesian Inference

More Related Content

What's hot

NCM RB PAPER
NCM RB PAPERNCM RB PAPER
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
Journal of Soft Computing in Civil Engineering
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
Rupak Roy
 
Quicksort
QuicksortQuicksort
Quicksort
maamir farooq
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
Rupak Roy
 
Quicksort algorithm
Quicksort algorithmQuicksort algorithm
Quicksort algorithm
Bapan Maity
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
Austin Benson
 

What's hot (7)

NCM RB PAPER
NCM RB PAPERNCM RB PAPER
NCM RB PAPER
 
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
A Method for Constructing Non-Isosceles Triangular Fuzzy Numbers Using Freque...
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Quicksort
QuicksortQuicksort
Quicksort
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
 
Quicksort algorithm
Quicksort algorithmQuicksort algorithm
Quicksort algorithm
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 

Similar to Data mining maximumlikelihood

Into to prob_prog_hari
Into to prob_prog_hariInto to prob_prog_hari
Into to prob_prog_hari
Hariharan Chandrasekaran
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
Statistics Assignment Help
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
Ergin Akalpler
 
Statistics
StatisticsStatistics
Statistics
Bob Smullen
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
Sadia Zafar
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
preetikumara
 
Maximum likelihood estimation from uncertain
Maximum likelihood estimation from uncertainMaximum likelihood estimation from uncertain
Maximum likelihood estimation from uncertain
IEEEFINALYEARPROJECTS
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
ijait
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
Valerii Klymchuk
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
Zachary Combs
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
Shiwani Gupta
 
CPSC 531: System Modeling and Simulation.pptx
CPSC 531:System Modeling and Simulation.pptxCPSC 531:System Modeling and Simulation.pptx
CPSC 531: System Modeling and Simulation.pptx
Farhan27013
 
Into to prob_prog_hari (2)
Into to prob_prog_hari (2)Into to prob_prog_hari (2)
Into to prob_prog_hari (2)
Hariharan Chandrasekaran
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
Gautam Kumar
 
report
reportreport
report
Arthur He
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Other classification methods in data mining
Other classification methods in data miningOther classification methods in data mining
Other classification methods in data mining
Kumar Deepak
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
Edda Kang
 

Similar to Data mining maximumlikelihood (20)

Into to prob_prog_hari
Into to prob_prog_hariInto to prob_prog_hari
Into to prob_prog_hari
 
MyStataLab Assignment Help
MyStataLab Assignment HelpMyStataLab Assignment Help
MyStataLab Assignment Help
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
Statistics
StatisticsStatistics
Statistics
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Maximum likelihood estimation from uncertain
Maximum likelihood estimation from uncertainMaximum likelihood estimation from uncertain
Maximum likelihood estimation from uncertain
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
Kaggle digits analysis_final_fc
Kaggle digits analysis_final_fcKaggle digits analysis_final_fc
Kaggle digits analysis_final_fc
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
CPSC 531: System Modeling and Simulation.pptx
CPSC 531:System Modeling and Simulation.pptxCPSC 531:System Modeling and Simulation.pptx
CPSC 531: System Modeling and Simulation.pptx
 
Into to prob_prog_hari (2)
Into to prob_prog_hari (2)Into to prob_prog_hari (2)
Into to prob_prog_hari (2)
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
report
reportreport
report
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Other classification methods in data mining
Other classification methods in data miningOther classification methods in data mining
Other classification methods in data mining
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
 

More from Harry Potter

How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptx
Harry Potter
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
Harry Potter
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
Harry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Harry Potter
 
Cache recap
Cache recapCache recap
Cache recap
Harry Potter
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
Harry Potter
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
Harry Potter
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
Harry Potter
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
Harry Potter
 
Smm & caching
Smm & cachingSmm & caching
Smm & caching
Harry Potter
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
Harry Potter
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
Harry Potter
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
Harry Potter
 
Object model
Object modelObject model
Object model
Harry Potter
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
Harry Potter
 
Encapsulation anonymous class
Encapsulation anonymous classEncapsulation anonymous class
Encapsulation anonymous class
Harry Potter
 
Abstract class
Abstract classAbstract class
Abstract class
Harry Potter
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
Harry Potter
 
Api crash
Api crashApi crash
Api crash
Harry Potter
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
Harry Potter
 

More from Harry Potter (20)

How to build a rest api.pptx
How to build a rest api.pptxHow to build a rest api.pptx
How to build a rest api.pptx
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Smm & caching
Smm & cachingSmm & caching
Smm & caching
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Encapsulation anonymous class
Encapsulation anonymous classEncapsulation anonymous class
Encapsulation anonymous class
 
Abstract class
Abstract classAbstract class
Abstract class
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Api crash
Api crashApi crash
Api crash
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 

Recently uploaded

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 

Recently uploaded (20)

Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 

Data mining maximumlikelihood

  • 2. Likelihood The likelihood is the probability of the data given the model.
  • 3. If we flip a coin and get a head and we think the coin is unbiased, then the probability of observing this head is 0.5. If we think the coin is biased so that we expect to get a head 80% of the time, then the likelihood of observing this datum (a head) is 0.8. The likelihood of making some observation is entirely dependent on the model that underlies our assumption. The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed. Likelihood
  • 4. Maximum Likelihood (ML) ML assumes a explicit model of sequence evolution. This is justifiable, since molecular sequence data can be shown to have arisen according to a stochastic process. ML attempts to answer the question: What is the probability that I would observe these data (a multiple sequence alignment) given a particular model of evolution (a tree and a process)?
  • 5. Likelihood calculations In molecular phylogenetics, the data are an alignment of sequences We optimize parameters and branch lengths to get the maximum likelihood Each site has a likelihood The total likelihood is the product of the site likelihoods The maximum likelihood tree is the tree topology that gives the highest (optimized) likelihood under the given model. We use reversible models, so the position of the root does not matter.
  • 6. What is the probability of observing a G nucleotide? If we have a DNA sequence of 1 nucleotide in length and the identity of this nucleotide is G, what is the likelihood that we would observe this G? In the same way as the coin-flipping observation, the likelihood of observing this G is dependent on the model of sequence evolution that is thought to underlie the data. Model 1: frequency of G = 0.4 => likelihood(G) = 0.4 Model 2: frequency of G = 0.1 => likelihood(G) = 0.1 Model 3: frequency of G = 0.25 => likelihood(G) = 0.25
  • 7. What about longer sequences? If we consider a gene of length 2 gene 1 GA The the probability of observing this gene is the product of the probabilities of observing each character Model frequency of G = 0.4 frequencyof A= 0.15 p(G) = 0.4 p(A) =0.15 Likelihood (GA) = 0.4 x 0.15 = 0.06
  • 8. …or even longer sequences? gene 1 GACTAGCTAGACAGATACGAATTAC Model simple base frequency model p(A)=0.15; p(C)=0.2; p(G)=0.4; p(T)=0.25; (the sum of all probabilities must equal 1) Likelihood (gene 1) = 0.000000000000000018452813
  • 9. Note about models You might notice that our model of base frequency is not the optimal model for our observed data. If we had used the following model p(A)=0.4; p(C) =0.2; p(G)= 0.2; p(T) = 0.2; The likelihood of observing the gene is L (gene 1) = 0.000000000000335544320000 L (gene 1) = 0.000000000000000018452813 The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.
  • 10. Increase in model sophistication It is no longer possible to simply invoke a model that encompasses base composition, we must also include the mechanism of sequence change and stasis. There are two parts to this model - the tree and the process (the latter is confusingly referred to as the model, although both parts really compose the model).
  • 11. Different Branch Lengths For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low. For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower. The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED. If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix2 ).
  • 12. I (A) II (C) I (A) II (C) v = 0.1 v = 1.0 v = µt µ = mutation rate t = time ximum Likelihood Two trees each consisting of single branch
  • 13. Jukes-Cantor model I (A) II (C) I (A) II (C) v = 0.1 v = 1.0
  • 15. 1 j N 1 C G G A C A C G T T T A C 2 C A G A C A C C T C T A C 3 C G G A T A A G T T A A C 4 C G G A T A G C C T A G C 1 42 3 1 C 2 C 4 G 3 A 5 6 L(j) = p C C A G A A C C A G C A C C A G T T + p + … + p
  • 16. L(j) = p C C A G A A C C A G C A C C A G T T + p + … + p N L = L(1) • L(2) • … L(N) = ΠL(j)j = 1 N lnL = lnL(1) + lnL(2) + … L(N) = Σ lnL(j)j = 1
  • 17. Likelihood of the alignment at various branch lengths 0 0,00002 0,00004 0,00006 0,00008 0,0001 0,00012 0,00014 0,00016 0,00018 0,0002 0 0,1 0,2 0,3 0,4 0,5 0,6
  • 18. Strengths of ML • Does not try to make an observation of sequence change and then a correction for superimposed substitutions. There is no need to ‘correct’ for anything, the models take care of superimposed substitutions. • Accurate branch lengths. • Each site has a likelihood. • If the model is correct, we should retrieve the correct tree (If we have long-enough sequences and a sophisticated-enough model). • You can use a model that fits the data. • ML uses all the data (no selection of sites based on informativeness, all sites are informative). • ML can not only tell you about the phylogeny of the sequences, but also the process of evolution that led to the observations of today’s sequences.
  • 19. Weaknesses of ML • Can be inconsistent if we use models that are not accurate. • Model might not be sophisticated enough • Very computationally-intensive. Might not be possible to examine all models (substitution matrices, tree topologies).
  • 20. Models • You can use models that: Deal with different transition/transversion ratios. Deal with unequal base composition. Deal with heterogeneity of rates across sites. Deal with heterogeneity of the substitution process (different rates across lineages, different rates at different parts of the tree). • The more free parameters, the better your model fits your data (good). • The more free parameters, the higher the variance of the estimate (bad).
  • 21. Choosing a Model Don’t assume a model, rather find a model that fits your data. Models often have “free” parameters. These can be fixed to a reasonable value, or estimated by ML. The more free parameters, the better the fit (higher the likelihood) of the model to the data. (Good!) The more free parameters, the higher the variance, and the less power to discriminate among competing hypotheses. (Bad!) We do not want to over-fit the model to the data
  • 22. What is the best way to fit a line (a model) through these points? How to tell if adding (or removing) a certain parameter is a good idea? • Use statistics • The null hypothesis is that the presence or absence of the parameter makes no difference • In order to assess signifcance you need a null distribution
  • 23. We have some DNA data, and a tree. Evaluate the data with 3 different models. model ln likelihood ∆ JC -2348.68 K2P -2256.73 91.95 GTR -2254.94 1.79 Evaluations with more complex models have higher likelihoods The K2P model has 1 more parameter than the JC model The GTR model has 4 more parameters than the K2P model Are the extra parameters worth adding?
  • 24. JC vs K2P K2P vs GTR We have generated many true null hypothesis data sets and evaluated them under the JC model and the K2P model. 95% of the differences are under 2.The statistic for our original data set was 91.95, and so it is highly significant. In this case it is worthwhile to add the extra parameter (tRatio). We have generated many true null hypothesis data sets and evaluated them under the K2P model and the GTR model. The statistic for our original data set was 1.79, and so it is not signifcant. In this case it is not worthwhile to add the extra parameters. You can use the χ2 approximation to assess significance of adding parameters
  • 26. Maximum likelihood Search for tree that maximizes the chance of seeing the data (P (Data | Tree)) Bayesian Inference Search for tree that maximizes the chance of seeing the tree given the data (P (Tree | Data))
  • 27. Bayesian Phylogenetics Maximize the posterior probability of a tree given the aligned DNA sequences Two steps - Definition of the posterior probabilities of trees (Bayes’ Rule) - Approximation of the posterior probabilities of trees Markov chain Monte Carlo (MCMC) methods
  • 30.
  • 31. Markov Chain Monte Carlo Methods Posterior probabilities of trees are complex joint probabilities that cannot be calculated analytically. Instead, the posterior probabilities of trees are approximated with Markov Chain Monte Carlo (MCMC) methods that sample trees from their posterior probability distribution.
  • 32. MCMC A way of sampling / touring a set of solutions,biased by their likelihood 1 Make a random solution N1 the current solution 2 Pick another solution N2 3 If Likelihood (N1 < N2) then replace N1 with N2 4 Else if Random (Likelihood (N2) / Likelihood (N1)) then replace N1 with N2 5 Sample (record) the current solution 6 Repeat from step 2
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.

Editor's Notes

  1. Even though we tend to refer to the tree and the model separately, they are in fact both parts of the model.