This presentation give an overview of the work that I did as part of my summer internship at University of Tokyo. I worked in FAiS, HGC, IMSUT under Prof. Kenta Nakai.
2. Summary
2
1. Studied Non-negative matrix factorization (NMF)
2. Applied it to JFCR39 and DepMap-Avana datasets
3. Learnt how to use a supercomputer
4. Made a python package - BigNMF
3. About NMF
Non-negative factorization (NMF) refers to a set of algorithms which factorize a
matrix X into two matrices W and H such that X = W.H and all three matrices are
non-negative.
This is a very versatile algorithm and finds use in many fields, one of which is
computational biology as the resultant factorized matrices are easy to interpret
and inspect.
There are many variations to this algorithm like Sparse NMF, Integrative NMF etc.
This paper popularized use of NMF.
4. Getting to know NMF
I started my internship by reading about NMF and its uses for the first few days.
Then, I studied different modifications of NMF and why they were made.
After this, I implemented the algorithm on my own. I used Standard and Sparse
NMF.
To fine tune my implementation, I tried the algorithm on simulated/toy data and
made changes as necessary.
This was to ensure that the algorithm would give correct factorization when I use it
on real data.
4
5. Project #1 - Biomarker discovery using
JFCR39 dataset
6. About the JFCR39 dataset
I worked on biomarker discovery using NMF.
This dataset contains information about cell
lines and their response to 69 drugs, mutation
status of 7 genes and protein expression of 19
proteins.
1. The drugs were a mix of chemotherapeutic
and molecular targeted drugs.
2. The mutations are either oncogenic or
tumor suppressor.
3. The proteins in the database can be
classified into 3 broad pathways - PI3K/Akt
pathway, mTOR pathway and MEK
pathway.
6
7. Workflow
1. Since this was a multi omics dataset, I had to use Joint NMF which is
combined factorization of multiple matrices with one dimension common.
2. I chose Standard NMF instead of Sparse NMF as Standard NMF gave better
clustering at all ranks.
3. Then I found the most suitable rank for analysis (3) with the help of
consensus matrices and cophenetic correlation (errors decreasing with
increasing rank).
4. I also read about the different types of drugs and their mechanisms of action,
following which I manually classified the genes and protein in the data for
verification of NMF results.
5. After fixing rank 3, I found the output matrices and then analysed them.
7
8. 8
Figures Joint NMF consensus matrices
Rank 2 Rank 3 Rank 4
R
a
n
k
3
Cophenetic correlation (higher is better)
was the highest for rank 3 at 0.97
Consensus matrices for
rank 3 were the most
clearly clustered.
The 4 matrices
(clockwise from top left)
are for cell lines, drugs,
mutations and proteins.
Cophenetic correlation
10. Analysis
1. PTEN mutants are sensitive to PI3K inhibitors.
2. PTEN mutation affects Akt phosphorylation.
3. mTOR is downstream of PI3K and Akt, yet mTOR and downstream proteins are not
affected by PTEN mutation - other proteins?
4. PTEN mutation can be a good biomarker for PI3K inhibitor.
5. PI3K inhibitors and mTOR inhibitors cluster together, suggesting similar mechanisms of
action.
6. Chemotherapeutic drugs don’t have any relation to any mutation.
7. Some chemotherapeutic drugs are clustered with Akt inhibitors and PI3K inhibitors -
similar effects?
8. Mutations happen in PIK3C subunits, KRAS and BRAF, they affect a wide range of
pathways.
9. Cell lines in each cluster are sensitive to the drugs in the respective cluster..
10
12. About the DepMap - CRISPR Avana dataset
DepMap is a repository about tumors and their dependencies for development of
precision treatments.
The data was survival of the cell line after gene knockout using CRISPR.
It contained cell lines and their sensitivities to different genes. The more negative
the value, the more dependent the cell line was on the gene.
We wanted to find tumor dependencies and group cancers based on them.
https://depmap.org/portal/
12
13. Workflow
1. Initially chose rank 3 again due to good clustering and high cophenetic
correlation.
2. But since it was a very large dataset, I would need a higher rank to clearly
factorize the data.
3. So, I plotted consensus matrices and cophenetic correlation for ranks greater
than 40.
4. Chose rank 50 due to good clustering and high cophenetic correlation.
5. Read about pathway analysis, GO analysis and PANTHER analysis.
6. Plotted pie charts for cell line clusters and corresponding genes.
7. The genes in the cluster were tumor suppressors for the cell lines in
corresponding cluster.
13
14. Figures
14
3
Pie chart representation of some clusters. Pathway analysis of
clusters would shed more light on cell line-gene dependencies.
P53PTEN
Tumor Suppressors
15. Supercomputer experience - Shirokane
1. This was my first time using a supercomputer.
2. I learnt various commands and utilities- qsub, qstat
3. Also learnt about memory usage - qfree, qavail
4. After basic usage of qsub, learnt different parameters with which I could
submit jobs.
5. Made shell scripts which could run multiple programs in parallel with different
parameters for faster workflow and convenience.
6. Had to customize my environment to work with the supercomputer better.
15
16. Python Package - BigNmf
Made a package BigNmf which is now available to
download from PyPI.
Run pip3 install bignmf.
This package implements both single and joint
NMF with the help of Standard, Sparse and
Integrative NMF.
You can get output matrices, consensus matrices,
cophenetic correlations and errors.
Proper documentation is available at -
https://bignmf.readthedocs.io/en/latest/
16
17. Learnings and new experiences
1. Great introduction to the world of research
2. Vastly improved knowledge about bioinformatics and biology in general
3. Improved coding practices looking at others codes
4. Supercomputer!
5. Interaction with people from different cultures
Always wanted to visit and live in Japan!
Thank you everyone! :)
17