[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omics data analysis
1. Machine Learning Techniques
for Omics Data Analysis
1
Dr Vesna Pajic, Manager in Bioinformatics, Velsera
vesna.pajic@velsera.com
DSC Europe, November 2023
2. Agenda
2
Introducing the team
What are omics technologies and what are they used for?
How the omics data look like and how to process them?
• Raw data
• Analysis steps
• Expected outcomes and results
Where does ML fit in?
Real scenario – an example project from Velsera and how we handled it
Conclusion and discussion
3. Introducing the team
Velsera is an international company formed in January 2023, by joining together
Ugentech, Pierian and SevenBridges.
Focused on powering precision medicine and improving human health.
Offices in Serbia, US, UK, Belgium, India, Turkey with more than 600 employees.
Bioinformatics (BIX) Team
• 40+ members
• Background in Computer Science, Mathematics,
Engineering, Molecular Biology, Pharmacy..
4. What are omics technologies and what are they used for?
Omics technologies are used for exploring different aspects of living organisms, including human
health.
Exploring genetic material:
• DNA molecules are present in all living cells, together with other genetic material (eg. RNA)
• DNA is seen as a recipe for life - instructions for growth, development, functioning and
reproduction of a living organism
Understand DNA code to understand:
• inheritance mechanisms, human origins;
• causes and development of diseases;
• possible treatments and cures, drug design;
• animal and plant breeding;
• environmental aspects;
• biofuels, and more.
5. What are omics technologies and what are they used for?
• Genome - complete DNA Material in one cell or organism
• Transcriptome – mRNA molecules present in a cell, tissue or organism
• Proteome – a set of proteins in a cell, tissue or organism
• Metabolome, Lipidome, Epigenome …..
OMICS technologies
developed for digitalizing and studying these molecules
genomics
epigenomics lipidomics
metabolomics
proteomics
transcriptomics
6. What are omics technologies and what are they used for?
DNA – a polymer consisting of two polynucleotide chains forming together a double helix
Each chain (strand) is comprised of nucleotide bases: Adenine (A), Thymine (T), Cytosine (C), and
Guanine (G)
Human genome has 3 billions of pairs of nucleotides (A, C, T, G)
7. Reading DNA with sequencing technology
Figures are taken from Illumina paper on sequencing:
https://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
8. Digitalized genetic material – raw data
Output from a sequencer after reading DNA molecules – a FILE
• Most common format is .FASTQ - a textual file with reads (segments of genome) and qualities of
base reading
Raw data specifics:
• Size of a FASTQ file can be several hundreds of GBs
• In one FASTQ file, usually there are billions of reads corresponding to genome segments
• The file contains errors introduced during sequencing
9. Processing raw data – common steps in a bioinformatics analysis
Preprocessing
• Quality Control
• Trimming
• Filtering
Secondary Analysis
• Assembly
• Alignment
• Variant Calling
• Gene Expression Analysis
10. Secondary analysis outputs – examples
VCF files - contain detected variants
(differences between DNA from a sample and
the reference genome
Feature count matrices - contain number of
transcripts (mRNA molecules) coming from a
gene per sample
11. Understand differences – get insight into a phenomena of interest
Tertiary analysis:
• Differential expression
• Cell Clustering
• Cell Composition
• Variant Annotation
• GWAS and more
Menhetn plot showing significant SNP loci from a GWAS study
Vulcano plot with significantly down- and up- regulated genes Clusters of cell with similar gene expression
12. Where does ML fits in?
Analysis Step ML methods used
Sequencing error correction Tools based on Suffix Trees, k-mer Clustering, Deep Neural Networks; Several
studies on how to set parameters of ML algorithms.
Assembly Neural Networks for binning of reads and detection of sequencing errors; SVMs and
HMMs for read assembly into contigs; Random Trees and Random Forest for read
overlap and assembly evaluation.
Alignment HMMs for pairwise alignment; RNN for global alignment
Variant Calling CNN as a universal approximator for the identification of variants in NGS reads (GATK
DeepVariant).
Transcript / Gene Quantification k-mer Clustering for assigning reads to transcriptome and quantification.
ML methods are used traditionally in almost every step of omics data analysis.
There are already well-established, standardized algorithms for secondary analysis steps –
improvements are possible, but usually the existing, best practice guidelines are used.
13. Where does ML fits in?
Auslander, N., Gussow, A. B., & Koonin, E. V. (2021). Incorporating Machine Learning into Established Bioinformatics Frameworks. International
Journal of Molecular Sciences, 22(6). https://doi.org/10.3390/ijms22062903
14. An example from Velsera’s portfolio
Prediction of a drug response from gene expression profiles
A client is developing oncology drugs that are targeting metabolic pathways in cancer cells.
• Interested in predicting cancer cell lines susceptible to their class of drug based on gene
expression data;
• Reach out to Velsera for a Discovery Partnership type of a project.
Core project team from Velsera
Jeff Brabec
Scientific Partner
Nikola Tešić
Responsible
Bioinformatics Analyst
Vojislav Varjačić
Senior Bioinformatics
Analyst
Nevena Nikolić
Bioinformatics Analyst
15. Drug sensitivity prediction
Drug response measuring:
• For a cell line, cell viability is measured at several increasing doses
and compared to an untreated control.
• A single number summarizing drug sensitivity is calculated.
• Summary metrics used: AAC, AUC, IC50
Drug response prediction:
• Choose representative set of data – gene expression profiles and
drug response metrics for a large cohort
• Build a model (method?)
• Use it for prediction on sample data
17. Clean up and prepare the data
Event Removed
Samples
Remaining
Samples
Gathering all samples that have entries for the drug of interest / 686
Removing duplicate entries 24 662
Removing samples with missing values encoded as NA for AAC 33 629
Removing samples that do not have corresponding gene
expression data
26 603
Removing samples from from cancer types that were of no
interest
87 516
Splitting the data 70/15/15 for training, validation, and testing / Training – 361
Validation - 77
Testing - 78
Focusing on a drug response for a given concentration instead of overall drug response:
• The dataset was transformed so that for each concentration – response pair, there was one row of a dataset, containing
both those columns as well as the expression data.
• More information is available to the model, as there are ~16 values per cell lineage
• The user has more control over what to predict – AAC's can be very different depending on the concentration range
over which they are calculated
18. Various feature selection approaches
Initial dataset had ~19,000 features (genes)
We experimented with several feature selection approaches (HSICLasso, Forward Feature Selection,
GAMBoost, Recursive Feature Elimination)
Decided to proceed with:
• Filtering features based on Spearman’s rank correlation
• Perform forward selection
• Try to narrow the feature set down with Backward selection
19. Models used
For assessing model’s accuracy, we used RMSE and NRMSE.
Elastic Net
• A form of regularized Linear Regressions
• Tried it with various feature selection method and on both original dataset and the individual concentration datasets;
improvements made but not as much as with other models
Generalized Additive Models (GAMs)
• Extend traditional linear models by enabling nonlinear relationships between predictor variables
Extreme Gradient Boosting (XGBoost)
• A gradient boosted decision tree ML method, good for preventing overfitting
• Tried it with hyperparameter optimization
XGBoost combined with hyperparameter optimization and feature selection showed the best results!
20. Results
The final model had 20 features that were subject to literature research – 14 out of 20 were
associated with different types of cancer.
The gene the client was interested in from the beginning, which activity is associated with several
types of cancers, was present within the chosen set of features.
The most important feature (gene) in the model (after the drug concentration) the client wasn’t aware
of.
The client is looking into this protein as a possible biomarker in liquid biopsy!