[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Budget friendly sample
sizes for genomics research
Biostatistician, bioinformatician
Ognjen Milicevic, MD

Why do you need a
biostatistician?

Common biostatistics tasks
● Cleaning and transforming data
● Data description
● Statistical testing
● Tabulation and visualization
● Bioinformatics (applied statistics for genomics)
● Post-hoc power calculations
● ...

Common biostatistics tasks
● Cleaning and transforming data
● Data description
● Statistical testing
● Tabulation and visualization
● Bioinformatics (applied statistics for genomics)
● Post-hoc power calculations
● Complain they weren't consulted earlier

Post-hoc sample size / power analysis
● Due to convenience, we justify choices already made
● Find the similar effect size in literature
● Use the posterior distribution as prior
● Set the desired power (80-100%)
● Adjust as needed for dropout, loss, margin-of-error
● Obtain the sample size you already have

Dear bioinformatician, how many samples do we need to
sequence to investigate...

NO CONVENIENCE!
● Not routinely done
● Effect size unknown
● Literature not helpful
● Multiple unknown genes
● Distribution is complex
● ...

RNA sequencing around the internet

DATA SCIENCE OF
RNA SEQUENCING

Natural variability of RNA per gene
De Torrente et al. (2020)
Surprisingly, the expression of less than 50% of all genes
was Normally-distributed, with other distributions including
Gamma, Bimodal, Cauchy, and Lognormal also
represented.
Liu et al. (2019)
Based on the analysis of a group of real gene expression
profiles, this study reveal that the primary density
distributions of the real profiles are normal/log-normal and
t distributions, accounting for 80% and 19% respectively.
20K+ genes

Representing RNAs with fragments
Gamma-Poisson distribution
Count and normalize to quantify (TPM)

Overview of the pipeline
Effect
between
groups
Inter-individual
variation in RNA
Batch effects
Representation
variability
Tissue
sample
Chemical
preparation
Sequencing

Count matrix and metadata
Each gene is an independent outcome

LAYERS UPON LAYERS OF VARIABILITY
So, what about those sample sizes?

COVID-19 RNA characterization
Example project

RNA characterization of COVID-19 (2021) - Plan
● Total RNA – virus and host (human)
● Nasopharyngeal swabs and blood samples
● Paired design (on admittance and discharge from hospital)
● 18 individuals, total of 72 samples
● Which biological pathways are affected? (DEG)
● What can we say about the viral load? (metagenomics)

Estimating sample size for RNA
● Theoretical models with assumed distributions
● Parameters inferred from previous datasets
● R-packages: RNASeqDesign, PROPER, powsimR, ssizeRNA
● Web tool: RNASeqSampleSize
● Variable result
● If cost is not relevant, choose the most conservative (largest)

Proposed approach
● Perform one estimate and use it
● Remove unwanted variability (batch
effect)
● Reduce variability with paired design
● Use meaningful metadata
● Filter the genes

● Remove unwanted variability
● Paired design
● Meaningful metadata
● Filter genes
A number of methods based on SVD remove high level batch effects
without specifically tracing them to interpretable variables.
One can use housekeeping or control genes as markers.
• SVA
• RUVseq
These methods produce new surrogate variables.
Colleague quote:
"Once I see batch effects, I can correct them mathematically, but I
never trust that dataset again."

Batch effects against the collaborative science!

● Paired design
● Filter genes
Paired design - taking control samples from patients
after resolution or before the event.
● Increases power
● Not all analysis frameworks can take advantage of it
● Sometimes biologically difficult
● Reduces DF by half

● Paired design
● Filter genes
Gender and age can always be relevant.
Collect metrics of sample quality (before and after
sequencing).
Disease subtypes can be a covariate or group variable.
Helps choosing when sequencing a subset.

● Paired design
● Filter genes
Multiple testing correction for 20K+ genes.
Remove mostly unexpressed genes.
A priori removal is allowed.

Results
● EdgeR GLM
● Nasal DEG p<0.05:
40(paired)/51(unpaired)
● Blood DEG p<0.05:
76(paired)/2(unpaired)
● Every parameter choice changes
results
● Validation?

Annotation representation testing – Panther.db
● Annotation is a subset of genes
● Multiple available annotation sets (structure, function, pathway...)
● We only use significant genes
● Overrepresentation test – chi-square to compare observed and
expected frequencies
● Enrichment test – Mann-Whitney to test randomness of ranks

Molecular function in blood (PAIRED)
● Increased
immunoglobulin binding
● Reduced smell (in blood!)
● Reduced oxygen binding
and carrier activity
● We consider the result
validated

Takeaways of the study
● Study rescued by pairing
● No batch to correct
● Almost no metadata
● Smaller signal in blood
● Specific tissue (nasal) more
robust

WHAT HAPPENED?
Data science implications

Reduced individual variation
Effect
between
groups
Inter-individual
variation in RNA
Batch effects
Representation
variability
Tissue
sample
Chemical
preparation
Sequencing
Intra

Reduced batch effects
Effect
between
groups
Inter-individual
variation in RNA
Batch effects
Representation
variability
Tissue
sample
Chemical
preparation
Sequencing
Intra

Easier to control for batches
● Pairing absorbs a proportion of
batch effects
● Usually 8 lanes in a flowcell
● Focus on pairs instead of whole
samples
● Aggregation of datasets easier

Technical downsides of pairing
● Loss of half DF
● Many frameworks cannot use it as easily as GLM-based ones
● RNA is used for other analyses:
○ SUPPA2 for alternative splicing
○ Building empirical distribution from all pairs of samples
○ If pairing was implemented, would reduce the observations
drastically

SHOULD WE ALWAYS PAIR?
Medical implications

Tissue implications
● Specific tissues have robust signatures without pairing
● Blood reflects many tissues:
○ Weaker signal
○ Local changes reflected
● Systemic effects are found only in blood
● Always available for sampling (minimum invasive)
● Blood analysis benefits from pairing

Utility implications
● Paired designs are easier to aggregate to meta-studies (robust to
batch effects)
● Blood controls can be used as unpaired controls for other studies (if
healthy enough)
● Solves the problem of finding controls
● If controls are after resolution, questionable health (long COVID)
● Some chronic diseases cannot be caught early or ever resolved, so
pairing is impossible

Example – cardiovascular events
● We are interested in markers of
plaque progression/instability
● Patient checkup and sampling every
X months
● Sequencing is expensive, sampling
and storing is not
● Sequence only the previous two
samples before the event

Example – neurodegenerative disease (ALS)
● We cannot predict the disease (10% familial)
● Patient available for sampling once diseased
● Sequence patients sufficiently apart
● We cannot find the root cause of ALS, as we
are not catching the initial event
● We can find signatures of neuronal suffering
and death, which is an actionable point
● Generalizes to all chronic diseases

Example – cancer
● For DNA, tumor is matched with blood
sample control
● For RNA, we need the normal
surrounding tissue
● Sampling the healthy normal target
tissue may be problematic
● Tissue margin – potential normal
sample
● Admixture of tumor in normal reduces
the signal (but not critically for RNA)

Many thanks to...
● Institute for Biocides and Medical Ecology
for providing the samples and sequencing
● HTEC Group for providing computational
resources and support
● School of Medicine, University of Belgrade
for supporting research
● Thanks to DSC organizers for the invite
● Last but not least...

ognjen.milicevic@med.bg.ac.rs
ognjen.milicevic@htecgroup.com
ognjen011@gmail.com

[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Recommended

Recommended

More Related Content

Similar to [DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Similar to [DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Editor's Notes