DESeq2 is used to analyze differential expression from RNA-seq count data using a generalized linear model. It models counts using a gamma-Poisson distribution and estimates dispersion using empirical Bayes shrinkage. Key steps include normalizing counts, estimating dispersion, fitting the linear model, and using Wald and likelihood ratio tests to identify differentially expressed genes while controlling the false discovery rate. Results can be explored using plots of p-values, mean-variance trends, ordination plots, and heatmaps to visualize sample relationships and differentially expressed genes.
A Systems Biology Approach to Natural Products ResearchHuda Nazeer
Explains the systems biology approach (holistic approach), its advantages and tools used compared to the reductionist approach in natural products research.
Protein Sequence, Structure, and Functional Databases: UniProtKB, Swiss-Prot, TrEMBL, PIR, MIPS, PROSITE, PRINTS, BLOCKS, Pfam, NDRB, OWL, PDB, SCOP, CATH, NDB, PQS, SYSTERS, and Motif. Presented at UGC Sponsored National Workshop on Bioinformatics and Sequence Analysis conducted by Nesamony Memorial Christian College, Marthandam on 9th and 10th October, 2017 by Prof. T. Ashok Kumar
Validation is the process of checking that your model is consistent with stereochemical standards i.e., validation is the process of evaluating reliability
In this presentation various aspects of validation are discussed
Ab Initio Protein Structure Prediction is a method to determine the tertiary structure of protein in the absence of experimentally solved structure of a similar/homologous protein. This method builds protein structure guided by energy function.
I had prepared this presentation for an internal project during my masters degree course.
A Systems Biology Approach to Natural Products ResearchHuda Nazeer
Explains the systems biology approach (holistic approach), its advantages and tools used compared to the reductionist approach in natural products research.
Protein Sequence, Structure, and Functional Databases: UniProtKB, Swiss-Prot, TrEMBL, PIR, MIPS, PROSITE, PRINTS, BLOCKS, Pfam, NDRB, OWL, PDB, SCOP, CATH, NDB, PQS, SYSTERS, and Motif. Presented at UGC Sponsored National Workshop on Bioinformatics and Sequence Analysis conducted by Nesamony Memorial Christian College, Marthandam on 9th and 10th October, 2017 by Prof. T. Ashok Kumar
Validation is the process of checking that your model is consistent with stereochemical standards i.e., validation is the process of evaluating reliability
In this presentation various aspects of validation are discussed
Ab Initio Protein Structure Prediction is a method to determine the tertiary structure of protein in the absence of experimentally solved structure of a similar/homologous protein. This method builds protein structure guided by energy function.
I had prepared this presentation for an internal project during my masters degree course.
Archive of experimentally determined 3D structures of biological macromolecules.
Established in 1971, by Research Collaboratory for Structural Bioinformatics (RCSB), Brookhaven National Laboratories, USA.
Archive contain atomic coordinates, bibliographic citations, primary and secondary structure information, crystallographic structure factors, NMR experimental data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
INTRODUCTION
HISTORY
WHAT ARE THE DATABASE…?
WHY DATABASE….?
THE “PERFECT” DATABASE
IDENTIFIERS and ACCESSION NUMBER
TECHNICAL DESIGN
MAINTAINANCE OF BIOLOGICAL DATABASES..
GENERAL FEATURES
SOURCES OF BIOLOGICAL DATA…
DIFFERENT TYPES OF BIOLOGICAL DATABASE
FUNCTION
DATA ENTRY AND QUALITY CONTROL
AVAILIBILITY
APPLICATION
DATA RECORD AT THE YEAR 2004
CONCLUSION
REFFERENCES
"A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information."
Molecular Dynamics for Beginners : Detailed OverviewGirinath Pillai
Detailed presentation of what is molecular dynamics, how it is performed, why it is performed, applications, limitations and software resources on how to perform calculations are discussed.
Archive of experimentally determined 3D structures of biological macromolecules.
Established in 1971, by Research Collaboratory for Structural Bioinformatics (RCSB), Brookhaven National Laboratories, USA.
Archive contain atomic coordinates, bibliographic citations, primary and secondary structure information, crystallographic structure factors, NMR experimental data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
INTRODUCTION
HISTORY
WHAT ARE THE DATABASE…?
WHY DATABASE….?
THE “PERFECT” DATABASE
IDENTIFIERS and ACCESSION NUMBER
TECHNICAL DESIGN
MAINTAINANCE OF BIOLOGICAL DATABASES..
GENERAL FEATURES
SOURCES OF BIOLOGICAL DATA…
DIFFERENT TYPES OF BIOLOGICAL DATABASE
FUNCTION
DATA ENTRY AND QUALITY CONTROL
AVAILIBILITY
APPLICATION
DATA RECORD AT THE YEAR 2004
CONCLUSION
REFFERENCES
"A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information."
Molecular Dynamics for Beginners : Detailed OverviewGirinath Pillai
Detailed presentation of what is molecular dynamics, how it is performed, why it is performed, applications, limitations and software resources on how to perform calculations are discussed.
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...Christos Argyropoulos
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 6: Normal Probability Distribution
6.3: Sampling Distributions and Estimators
Muktapishti is a traditional Ayurvedic preparation made from Shoditha Mukta (Purified Pearl), is believed to help regulate thyroid function and reduce symptoms of hyperthyroidism due to its cooling and balancing properties. Clinical evidence on its efficacy remains limited, necessitating further research to validate its therapeutic benefits.
Adv. biopharm. APPLICATION OF PHARMACOKINETICS : TARGETED DRUG DELIVERY SYSTEMSAkankshaAshtankar
MIP 201T & MPH 202T
ADVANCED BIOPHARMACEUTICS & PHARMACOKINETICS : UNIT 5
APPLICATION OF PHARMACOKINETICS : TARGETED DRUG DELIVERY SYSTEMS By - AKANKSHA ASHTANKAR
These simplified slides by Dr. Sidra Arshad present an overview of the non-respiratory functions of the respiratory tract.
Learning objectives:
1. Enlist the non-respiratory functions of the respiratory tract
2. Briefly explain how these functions are carried out
3. Discuss the significance of dead space
4. Differentiate between minute ventilation and alveolar ventilation
5. Describe the cough and sneeze reflexes
Study Resources:
1. Chapter 39, Guyton and Hall Textbook of Medical Physiology, 14th edition
2. Chapter 34, Ganong’s Review of Medical Physiology, 26th edition
3. Chapter 17, Human Physiology by Lauralee Sherwood, 9th edition
4. Non-respiratory functions of the lungs https://academic.oup.com/bjaed/article/13/3/98/278874
Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...Oleg Kshivets
Overall life span (LS) was 1671.7±1721.6 days and cumulative 5YS reached 62.4%, 10 years – 50.4%, 20 years – 44.6%. 94 LCP lived more than 5 years without cancer (LS=2958.6±1723.6 days), 22 – more than 10 years (LS=5571±1841.8 days). 67 LCP died because of LC (LS=471.9±344 days). AT significantly improved 5YS (68% vs. 53.7%) (P=0.028 by log-rank test). Cox modeling displayed that 5YS of LCP significantly depended on: N0-N12, T3-4, blood cell circuit, cell ratio factors (ratio between cancer cells-CC and blood cells subpopulations), LC cell dynamics, recalcification time, heparin tolerance, prothrombin index, protein, AT, procedure type (P=0.000-0.031). Neural networks, genetic algorithm selection and bootstrap simulation revealed relationships between 5YS and N0-12 (rank=1), thrombocytes/CC (rank=2), segmented neutrophils/CC (3), eosinophils/CC (4), erythrocytes/CC (5), healthy cells/CC (6), lymphocytes/CC (7), stick neutrophils/CC (8), leucocytes/CC (9), monocytes/CC (10). Correct prediction of 5YS was 100% by neural networks computing (error=0.000; area under ROC curve=1.0).
Title: Sense of Smell
Presenter: Dr. Faiza, Assistant Professor of Physiology
Qualifications:
MBBS (Best Graduate, AIMC Lahore)
FCPS Physiology
ICMT, CHPE, DHPE (STMU)
MPH (GC University, Faisalabad)
MBA (Virtual University of Pakistan)
Learning Objectives:
Describe the primary categories of smells and the concept of odor blindness.
Explain the structure and location of the olfactory membrane and mucosa, including the types and roles of cells involved in olfaction.
Describe the pathway and mechanisms of olfactory signal transmission from the olfactory receptors to the brain.
Illustrate the biochemical cascade triggered by odorant binding to olfactory receptors, including the role of G-proteins and second messengers in generating an action potential.
Identify different types of olfactory disorders such as anosmia, hyposmia, hyperosmia, and dysosmia, including their potential causes.
Key Topics:
Olfactory Genes:
3% of the human genome accounts for olfactory genes.
400 genes for odorant receptors.
Olfactory Membrane:
Located in the superior part of the nasal cavity.
Medially: Folds downward along the superior septum.
Laterally: Folds over the superior turbinate and upper surface of the middle turbinate.
Total surface area: 5-10 square centimeters.
Olfactory Mucosa:
Olfactory Cells: Bipolar nerve cells derived from the CNS (100 million), with 4-25 olfactory cilia per cell.
Sustentacular Cells: Produce mucus and maintain ionic and molecular environment.
Basal Cells: Replace worn-out olfactory cells with an average lifespan of 1-2 months.
Bowman’s Gland: Secretes mucus.
Stimulation of Olfactory Cells:
Odorant dissolves in mucus and attaches to receptors on olfactory cilia.
Involves a cascade effect through G-proteins and second messengers, leading to depolarization and action potential generation in the olfactory nerve.
Quality of a Good Odorant:
Small (3-20 Carbon atoms), volatile, water-soluble, and lipid-soluble.
Facilitated by odorant-binding proteins in mucus.
Membrane Potential and Action Potential:
Resting membrane potential: -55mV.
Action potential frequency in the olfactory nerve increases with odorant strength.
Adaptation Towards the Sense of Smell:
Rapid adaptation within the first second, with further slow adaptation.
Psychological adaptation greater than receptor adaptation, involving feedback inhibition from the central nervous system.
Primary Sensations of Smell:
Camphoraceous, Musky, Floral, Pepperminty, Ethereal, Pungent, Putrid.
Odor Detection Threshold:
Examples: Hydrogen sulfide (0.0005 ppm), Methyl-mercaptan (0.002 ppm).
Some toxic substances are odorless at lethal concentrations.
Characteristics of Smell:
Odor blindness for single substances due to lack of appropriate receptor protein.
Behavioral and emotional influences of smell.
Transmission of Olfactory Signals:
From olfactory cells to glomeruli in the olfactory bulb, involving lateral inhibition.
Primitive, less old, and new olfactory systems with different path
ABDOMINAL TRAUMA in pediatrics part one.drhasanrajab
Abdominal trauma in pediatrics refers to injuries or damage to the abdominal organs in children. It can occur due to various causes such as falls, motor vehicle accidents, sports-related injuries, and physical abuse. Children are more vulnerable to abdominal trauma due to their unique anatomical and physiological characteristics. Signs and symptoms include abdominal pain, tenderness, distension, vomiting, and signs of shock. Diagnosis involves physical examination, imaging studies, and laboratory tests. Management depends on the severity and may involve conservative treatment or surgical intervention. Prevention is crucial in reducing the incidence of abdominal trauma in children.
These lecture slides, by Dr Sidra Arshad, offer a quick overview of the physiological basis of a normal electrocardiogram.
Learning objectives:
1. Define an electrocardiogram (ECG) and electrocardiography
2. Describe how dipoles generated by the heart produce the waveforms of the ECG
3. Describe the components of a normal electrocardiogram of a typical bipolar lead (limb II)
4. Differentiate between intervals and segments
5. Enlist some common indications for obtaining an ECG
6. Describe the flow of current around the heart during the cardiac cycle
7. Discuss the placement and polarity of the leads of electrocardiograph
8. Describe the normal electrocardiograms recorded from the limb leads and explain the physiological basis of the different records that are obtained
9. Define mean electrical vector (axis) of the heart and give the normal range
10. Define the mean QRS vector
11. Describe the axes of leads (hexagonal reference system)
12. Comprehend the vectorial analysis of the normal ECG
13. Determine the mean electrical axis of the ventricular QRS and appreciate the mean axis deviation
14. Explain the concepts of current of injury, J point, and their significance
Study Resources:
1. Chapter 11, Guyton and Hall Textbook of Medical Physiology, 14th edition
2. Chapter 9, Human Physiology - From Cells to Systems, Lauralee Sherwood, 9th edition
3. Chapter 29, Ganong’s Review of Medical Physiology, 26th edition
4. Electrocardiogram, StatPearls - https://www.ncbi.nlm.nih.gov/books/NBK549803/
5. ECG in Medical Practice by ABM Abdullah, 4th edition
6. Chapter 3, Cardiology Explained, https://www.ncbi.nlm.nih.gov/books/NBK2214/
7. ECG Basics, http://www.nataliescasebook.com/tag/e-c-g-basics
Recomendações da OMS sobre cuidados maternos e neonatais para uma experiência pós-natal positiva.
Em consonância com os ODS – Objetivos do Desenvolvimento Sustentável e a Estratégia Global para a Saúde das Mulheres, Crianças e Adolescentes, e aplicando uma abordagem baseada nos direitos humanos, os esforços de cuidados pós-natais devem expandir-se para além da cobertura e da simples sobrevivência, de modo a incluir cuidados de qualidade.
Estas diretrizes visam melhorar a qualidade dos cuidados pós-natais essenciais e de rotina prestados às mulheres e aos recém-nascidos, com o objetivo final de melhorar a saúde e o bem-estar materno e neonatal.
Uma “experiência pós-natal positiva” é um resultado importante para todas as mulheres que dão à luz e para os seus recém-nascidos, estabelecendo as bases para a melhoria da saúde e do bem-estar a curto e longo prazo. Uma experiência pós-natal positiva é definida como aquela em que as mulheres, pessoas que gestam, os recém-nascidos, os casais, os pais, os cuidadores e as famílias recebem informação consistente, garantia e apoio de profissionais de saúde motivados; e onde um sistema de saúde flexível e com recursos reconheça as necessidades das mulheres e dos bebês e respeite o seu contexto cultural.
Estas diretrizes consolidadas apresentam algumas recomendações novas e já bem fundamentadas sobre cuidados pós-natais de rotina para mulheres e neonatos que recebem cuidados no pós-parto em unidades de saúde ou na comunidade, independentemente dos recursos disponíveis.
É fornecido um conjunto abrangente de recomendações para cuidados durante o período puerperal, com ênfase nos cuidados essenciais que todas as mulheres e recém-nascidos devem receber, e com a devida atenção à qualidade dos cuidados; isto é, a entrega e a experiência do cuidado recebido. Estas diretrizes atualizam e ampliam as recomendações da OMS de 2014 sobre cuidados pós-natais da mãe e do recém-nascido e complementam as atuais diretrizes da OMS sobre a gestão de complicações pós-natais.
O estabelecimento da amamentação e o manejo das principais intercorrências é contemplada.
Recomendamos muito.
Vamos discutir essas recomendações no nosso curso de pós-graduação em Aleitamento no Instituto Ciclos.
Esta publicação só está disponível em inglês até o momento.
Prof. Marcus Renato de Carvalho
www.agostodourado.com
Knee anatomy and clinical tests 2024.pdfvimalpl1234
This includes all relevant anatomy and clinical tests compiled from standard textbooks, Campbell,netter etc..It is comprehensive and best suited for orthopaedicians and orthopaedic residents.
How to analyse bulk transcriptomic data using Deseq2
1. Understanding how to analyse bulk
transcriptomic data using DESEq2
Taken largely from
Modern statistics for modern biology
https://www.huber.embl.de/msmb/
2. Frequentist statistics
• A type of statistical inference that draws conclusions
from sample data by emphasizing the frequency or
proportion of the data.
3. Generative model
• All the parameters of the model are known
• Given an observable variable X and a target variable Y, a
generative model is a statistical model of the join
probability distribution on, P(X,Y), P(X|Y = y).
probability of Y given X
X has already occurred and has been measured
4. Probability distributions
• Probability distribution is a mathematical function that
gives the probabilities of occurrence of different possible
outcomes.
• Mathematical description of the probabilities of events
• Determined empirically from a distribution of data.
• There are some commonly observed distributions –
grouped by the process that they are related to
5. Normal (Gaussian) distribution
• Most important distribution – central limit theorem.
𝜇 mean
𝜎 standard deviation
Probability density function:
6. Log normal distribution
• Probability distribution of a
random variable whose
logarithm is normally
distributed.
• Y = ln(X) has a normal
distribution
7. • a discrete probability distribution that expresses the
probability of a given number of events occurring in a
fixed interval of time or space if these events occur with
a known constant mean rate and independently of the
time since the last event
Possion distribution
10. Statistical modelling
• Once you have a generative model and you have the
parameters to define the probabilities we can start
decision making.
• Goodness of fit – to identify dist
• Statistical detective
We start with data X and use
this to estimate the
parameters of a distribution.
These estimates are donated
by Greek letters and a hat.
11. Rootograms
• Red is theoretical distribution
• Bottom of bar should align with horizontal
• Assess goodness of fit
12. Bayesian statistics
• A method of statistical inference in which Bayes theorem
is used to update the probability for a hypothesis as
more evidence or information becomes available.
• Practical approach where a prior and posterior
distribution are used to model.
• Prior – probability that would express ones belief before
evidence is taken into account. This unknown would
maybe be a parameter of the model or a latent variable
rather than an observable variable.
• Posterior – random variable conditional on the evidence
obtained for an experiment after the relevant evidence is
taken into account
13. Bayesian statistics
• We use probability distributions to express our
knowledge about the parameters, and then use data to
update our knowledge.
• For example, shifting the distributions and making them
more narrow (more to come later).
14. High-throughput count data
• Challenges:
– Large dynamic range - 0 to millions. (heteroscedasticity).
– Non negative integers with uneven distributions – normal or log-
normal distributions may not fit.
– We need to understand the sampling biases and correct.
– Small sample size makes estimation of dispersion difficult.
15. Normalisation
• Normalisation can be misleading term
• Nothing to do with normal distribution
• The aim is to identify sources of bias and take them into
account
• For RNA-seq that’s usually library size (number of reads
for each sample)
17. Normalisation
• Consider this:
– If we estimate s for each of two
samples by the sum of its counts then
the slope of the blue line represents
their ratio.
– Gene C is downregulated in sample 2
while the other genes are upregulated
– If we now estimate s such that the
ratios correspond to the red line.
– Only gene C is downregulated in
sample 2
– The slope of the red line is generated
using robust regression – This is what
DEseq2 does.
Size factor estimation. The points
correspond to hypothetical genes
whose counts in two samples are
indicated by their xx- and yy-
coordinates. The lines represent
ways of estimating size factor.
18. Dispersion
• Fragments are molecules being sequenced (equates to
cDNA molecules).
• A sequencing library of n1 fragments corresponding to
gene 1, n2 corresponding to gene 2.
• A total library size is n = n1 + n2 + ..
• We submit the sample for sequencing and determine the
identity of r randomly sampled fragments.
19. Dispersion
• The number of genes is in the tens of thousands
• The value of n (fragments) depends on the amount of
cells that were used to prepare the lib, which could
potentially be billions
• The number of reads r is usually in the tens of millions
20. Dispersion
• A read is the sequence obtained from a fragment.
• Probability that a given read maps to the ith gene is:
– pi = ni/n
• We can model the number of reads for gene i by a
Poission distribution
• The rate of the Poission process is the product of pi, the
initial proportion of fragments for the ith gene, times r
(number of reads):
– 𝜆𝑖 = rp𝑖
– 𝜆𝑖 is the passion parameter (lambda usually
represents this)
21. Dispersion
• In practice we aren’t usually interested in modelling the
counts of the single library but between libraries
• That’s the difference between control and treatment
• It turns out that replicates vary more than the Poission
distribution
• We need to model this so we instead use a Gamma-
Poission (aka. Negative binomial) distribution which
better suits our modelling needs
23. We are now ready to fit a model (GLM)
But before GLM we need to understand
linear modelling
24. Linear models
• We perform an siRNA knockdown of CTLA-4 gene. We
also want to study the effect of a drug X.
• We treat cells with neg control, siRNA alone, drug X
alone or both:
y is the experimental measurement of interest i.e. the transformed expression
level of a gene
The coefficient 𝛽0 is the base level of the measurement in control (a.k.a. the
intercept)
𝓍1 and 𝓍2 are binary variables.
𝓍1 Takes value 1 if siRNA is administered
𝓍2 indicated whether drug was administered
25. Linear models
• If only siRNA is used: x1 = 1 and x2 = 0. equation
simplifies to:
• 𝛽1 represents difference between treatment and control.
If measurements are on log scale then:
• This is the logarithmic change due to treatment with
siRNA
26. Linear models
• What if we treat with both drug and siRNA
• x1 = 1 and x2 = 1
• This means that 𝛽12 is the difference between the
observed outcome, y, and the outcome from the
individual treatments, obtained by adding to the baseline
the effect of siRNA alone (𝛽1) and of drug alone (𝛽2).
• 𝛽12 is called the interaction effect of siRNA and drug.
27. Design matrix
• We can encode an experimental design in a matrix:
• The columns represent experimental factors and rows
represent the different experimental conditions
28. Noise and replicates
• To estimate noise you need replicates
• Assessment of uncertainty of our estimated 𝛽s
• Extend equation:
Added the index j and a new term 𝜀j
The index now counts over our individual replicate experiments e.g. if for each of
the four conditions we perform three replicates, then j counts from 1 to 12.
The design matrix has 12 rows, and xjk is the value of the matrix in its jth row and
kth column
29. Noise and replicates
• But what is 𝜀j?
• This is something we call the residuals and absorbs the
differences between replicates
• But we need to take into account the system of twelve
equations (top equation), we have more variables (12
epsilons and four betas)
• We can address this by minimizing the sum of the squared
residuals:
30. General linear model for counts
• The above equation models the the expected value of
the outcome y, as a linear function of the design matrix,
and its fitted to the data according to the least sum of
squares
• We now want to generalize these assumptions
31. • Modelling data on a transformed scale:
– It can be more fruitful to consider data on a scaled level than its
natural scale level – this can be generalized
• Error distributions:
– Other generalized concerns are the minimization criteria
– Generalization can make is to use a different probabilistic model
than the normal distribution – in our case we know that we can
deal with our counts data using a gamma-Poission distribution
(negative binomial distribution)
General linear model for counts
32. • DESeq2 uses the following generalized model:
The counts Kji for gene i, sample j are modelled using a
gamma-Poission (GP) with two parameters, the mean 𝜇ij
and the dispersion 𝛼i
By default the dispersion is different for each gene i, but the
same across all samples, therefore it has no index j
General linear model for counts
33. • The next equation states that the mean is composed of a
specific size factor sj and qij, which is proportional to the
true expected concentration in fragments (sequencing
reads) for gene i in sample j
• qij – is given by the linear model in third equation by the
link function, log2
General linear model for counts
34. • The design matrix (xjk) is the same for all genes – the
rows (j) correspond to samples, its columns (k)
correspond to experimental factors
• The coefficients 𝛽ik give the log2 fold changes for gene i
for each column of the design matrix X
General linear model for counts
35. Sharing dispersion
• In RNA-seq you typically only have a few replicates
– Difficult to estimate within group variability
• Solution is to pool information across genes which are
expressed at a similar level
– Assumes strength of similar average expression strength have
similar dispersion
36. Sharing dispersion info
• Earlier in the presentation we explained Bayesian
analysis
• We use additional information to improve our estimates,
information we know a priori or have from our analysis or
other but similar data
• This is more useful if the data is noisy
• DESeq2 uses an empirical Bayes approach for the
estimation of dispersion parameters (the 𝛼s) and
optionally the logarithmic fold changes (the 𝛽s)
Alpha is dispersion
37. • The priors are taken from the
distributions of the maximum
likelihood estimates (MLEs) across
all genes
• Likelihood function measures the
goodness of fit of a model
• So for MLE we are selecting the
best probability distribution that is
optimal for estimating the
parameters of our distribution
Sharing dispersion info
38. Sharing dispersion info
Shrinkage estimation of logarithmic fold
change estimates by use of empirical prior in
DESeq2.
Two genes with similar means and MLE
logarithmic fold change are in blue and green
Low dispersion for blue and high for green
Lower panel – density plots are shown of
normalized likelihoods (solid lines) and the
posteriors (dashed lines). Black shows prior
estimates from MLE of all genes
Higher dispersion of green = likelihood is wider
and less sharp, the prior has more influence
on the posterior than in the blue case
39. Sharing dispersion info
• This means that the Bayes machinery “shrinks” each
per-gene
• The amount depends on the sharpness of the peak
• Mathematics is explained in detail in
Love et al 2014
40. Dispersion
• Estimates genewise dispersion using maximum
likelihood
• Fits a curve to measure dependence of these estimates
on the average expression strength
• Shrinks gene wise values towards the curve using an
empirical Bayes approach (more later)
Expression level
Variability
Each dot is gene and if it has low
expression the variability is high.
Blue is the final genes that have been
“pulled” towards the red line
41. • Once a GLM is fitted then a wald test is
performed for the treatment coefficient
Wald test
42. • Analyze all levels of a factor at once
• LRT which is used to identify any genes that show
change in expression across the different levels
• This type of test can be especially useful in analyzing
time course experiments
LRT test
50. Exploring the results
• There are four main plots that explain a lot about
your data:
– The histogram of p values
– The MA plot
– An ordination plot
– A heatmap
51. Exploring the results
• The left hand peak is differentially
expressed genes.
• Background is right hand.
• Pvalue < 0.01 ~ 990 genes.
• The background is around 100
genes.
• This suggests 10% FDR.
• A shifted background distribution
could indicate batch effects.
52. • Fold changes vs mean of
size-factor normalized
counts
• Log scale for both axes
• Blue points are significant
genes
Exploring the results
55. • Fit GLM for all genes without shrinkage
• Estimate normal empirical-Bayes prior
from non-intercept coefficients
• Add log prior to the GLMs log likelihoods
results in a ridge penalty
• Fit GLMs again now with penalized
likelihoods to get shrunken coefficients
Shrinkage estimation
56. • PCA is a high dimensional
reduction technique
• Variance plotted on each
PC loadings
• Further info:
https://builtin.com/data-
science/step-step-
explanation-principal-
component-analysis
Exploring the results
57. • Heatmaps can be a powerful way
of visualizing a subset of genes.
• Also dendrogram is very useful for
understanding sample and meta
data associations.
Exploring the results
58. Dealing with outliers
• Sometimes data can contain very large counts that
appear unrelated to the experimental design
• Outliers arise for many reasons – technical experimental
artefacts
• A diagnostic test for outliers is Cook’s distance
• Cook’s distance is a measure of how much a single
sample is influencing the fitted coefficients for a gene
Editor's Notes
CLT = independent random variables are added, their normalizedsum tends toward a normal distribution even if the original variables themselves are not normally distributed
Can be sharp or broad
Posterior - means after taking into account the relevant evidence related to a particular thing being examined.
Coefficient - a numerical or constant quantity placed before and multiplying the variable in an algebraic expression (e.g. 4 in 4x y).
Beta - A beta weight is a standardized regression coefficient (the slope of a line in a regression equation).
epsilon
Residual is just the error of our result. It is the difference between the observed and the expected value of our quantity of interest
Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account.