Topography and sediments of the floor of the Bay of Bengal
Integrative inference of transcriptional networks in Arabidopsis yields novel ROS signalling regulators
1. Unravelling transcription factor functions
through integrative inference of transcriptional
networks in Arabidopsis
KlaasVandepoele
Department of Plant Biotechnology and Bioinformatics, UGent
VIB Center for Plant Systems Biology
Utrecht
Artificial intelligence in plant science and breeding
24 February 2021
plaza_genomics
2. Comparative Network Biology -Vandepoele lab
• Extract biological knowledge from large-scale experimental data sets using data
integration, comparative sequence & expression analysis, and network biology, to improve
our understanding of gene functions and regulation in plants and diatoms.
Plant Gene Regulatory Networks Comparative functional genomics
• PLAZA 4.0: an integrative resource for functional, evolutionary and
comparative plant genomics. Van Bel et al., 2018, Nucleic Acids Res
• Curse: building expression atlases and co-expression networks
from public RNA-Seq data. Vaneechoutte D, Vandepoele K.
Bioinformatics. 2019
• TF2Network: predicting transcription factor regulators and gene
regulatory networks in Arabidopsis using publicly available binding
site information. Kulkarni et al., 2018, Nucleic Acids Res
• Enhanced maps of transcription factor binding sites improve
regulatory networks learned from accessible chromatin data.
Kulkarni et al., Plant Physiol. 2019
3. Mapping of Gene Regulatory Networks (GRNs)
Mejia-Guerra et al., 2012
Arabidopsis
-1,700-2,500 Transcription Factors
- 180-791 miRNA
- 2,708 expressed lncRNA
49MB non-coding DNA
AtRegNet:
17,224 regulatory interactions
4. Experimental characterization of transcriptional activity and
regulatory control
ENCODE
How to integrate the biological knowledge captured by different –omics layers
to build better networks reporting functional regulatory interactions?
5. 1.TF ChIP-Seq
• in vivo method to measure protein-DNA
interactions using chromatin immuno-
precipitation
• Different cellular conditions can be profiled
TF ChIP-Seq
Furey et al., 2012
ChIP
7. 2. in vitroTF binding specificities
Wang et al., 2011
ModelTF binding site as
Position Weight Matrix (PWM)
based on k-mer signals
Protein binding
microarray
target gene
Arabidopsis: PWMs for 990TFs
PWM
8. 3. DNase-seq - Profiling of accesible chromatin
Hesselberth et al, 2009 binding site TF protein
DH footprint
DNase I hypersensitive site
(DHS)
DHS+PWM
9. Map all known PWMs on the
promoters of the Arabidopsis query
gene and its orthologs
Count per PWM position the #species
that support a TF binding site
Significance estimation (FDR<10%)
Van deVelde et al., Plant Phys 2016 -A Collection of Conserved Non-Coding Sequences to StudyGene Regulation in Flowering Plants.
Conserved PWM
4. Detection of conservedTF binding sites using
phylogenetic footprinting
10. 5-6. Network inference based on expression data
TF
target
Expression-based network inference
GENIE3 - Huynh-Thu et al., 2013
GENIE3
COE TF-target
GENIE3
12. Benchmarking of different methods to map gene regulatory
networks
Gold standard: 5.7k interactions covering 522 TFs (AtRegNet)
Test set: 20% of gold standard (80% used for training)
13. Benchmarking of different methods to map gene regulatory
networks
JanVan deVelde
Gold standard: 5.7k interactions covering 522TFs (AtRegNet+literature)
Test set: 20% of gold standard (80% used for training later)
14. Supervised learning: a network-based approach for large-
scale functional data integration
Marbach et al., Genome Research 2012
Gradient Boosting Machine
• 1000 trees (shrinkage of 0.01, interaction depth 3, 10-fold CV training)
• 80% training data withTrue:False sampling ratio of 3:1
• 7 input networks
d Motifs of TFs ChIP Binding of TFs TF Motif Occurence in
Open Chromatin
TF-Gene Co-Expression
targets
1800 TFs
Supervised Network
Known Interactions
Training Data
Test Data
Cross-validation
Input Features
Classifier
Target gene
TF gene
COE TF-target GENIE3 COE+PWM
DHS+PWM Conserved PWM
PWM
Supervised Learning max F1: 1793k interactions – 1766TFs
Gold standard
Max F1 network -Test set
Recall: 46%
Precision: 71%
F1-measure: 57%
ChIP
17. iGRN captures functionalTF – target gene interactions
-- overlap target genes not significant (p-value hypergeometric distribution > 0.05)
--
34/40TFs have significant overlap between predicted target genes and DE
genes afterTF perturbation
--
--
--
--
--
18. iGRN-based functional annotation ofTFs
Recovery of experimental Gene Ontology Biological
Process annotations for TFs with known function
TF
TF function
Target genes
• Recovery of known experimentally-
supported functions for >600 TFs
• Novel functional predictions for 268
unknown TFs
• Highly complementary with AraNet v2
iGRN
23. Phenotypes for predicted ROS-TFs
iGRN identified novel ROSTFs from the GRAS, BES1 and GATA families
24. Expression patterns for novel ROS-TFs
Responsiveness to a wide range of
oxidative stress conditions?
• 14/17 known ROS TFs
• 6/13 novel ROS TFs
Many novel ROS TFs would not have been
predicted solely relying on differential
expression at the whole plant or organ
level!
25. Conclusions
JanVan deVelde Inge De Clercq
Different regulatory –omics data types as well as advanced
computational integration methods contribute significantly to the
improved delineation of high-quality gene regulatory networks
TF binding site-based as well as expression-based regulatory
networks offer a complementary view on functional gene
regulatory interactions
Gene regulatory networks obtained by supervised learning are a
starting point for
the systematic functional/regulatory annotation of all Arabidopsis
genes
new biological discoveries
Li Liu, DriesVaneechoutte
Robin Pottie, Xiaopeng Liu, FrankVan Breusegem
26. Further reading
Curse: Building expression atlases and co-expression networks from public
RNA-Seq data. Vaneechoutte and Vandepoele (2019) Bioinformatics
TF2Network: predicting transcription factor regulators and gene regulatory
networks in Arabidopsis using publicly available binding site information
Kulkarni, Vaneechoutte, Van de Velde and Vandepoele (2018). Nucleic Acids
Research
Editor's Notes
Good afternoon. Thanks to the organizers for the invitation to present our work today.
Through the development and application of various bioinformatics methods, we try to identify new aspects of genome biology, especially in the area of gene function prediction, gene regulation and evolutionary/systems biology.
For today, the playground will be Arabidopsis, a plant model with a fairly simple genome, but still containing thousands of regulators. When mapping gene regulatory networks, the goal is to identify which TF binds to the promoter of which target gene, and to determine what is the functional regulatory consequence for growth, development or stress response. Especially the latter criterion is important, as we know that many in vivo TF binding events do not lead to changes in transcriptional activity of the target gene.
Thus, if we focus on different molecular profiling methods assessing chromatin state, TF binding or transcriptional activity, a major challenge is …
In the next slides, I will very briefly introduce different experimental and computational methods to map GRNs, which will be discussed, compared and integrated using ML. In a second part, I will show how such an integrated network has a good predictive power to study TF functions.
Thus, in total I will present 7 approaches, which will be input networks, which all have their strengths and weakness, for example with respect to completeness and accuracy.
Apart from capturing potential TF binding events, which can be considered as an input signal to transcriptional control, it is also possible to focus on output signals, which is for example the spatial-temporal expression of all genes in the genome.
- GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems.
- In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees.
- The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link.
-Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed
PWM mapping: FP problem PWM mapping – reduce witthrough integration co-regulatory information
Explain all input networks again + size of the networks. How good do these capture true regulatory events?
Given the strenghts and weaknesses of these different networks, a network-based approach for large-scale functional data integration was applied.
Gradient boosting produces a prediction model for classification using an ensemble of weaker prediction models.
The family of boosting methods is based on a constructive strategy of ensemble formation. The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.
Model = Decision trees; Posterior prob to be a true interaction: 0.35
22, we can then show results for recent ChIP-Seq data + TF DE gene sets (incl. significance overlap using hypergeometric test).
70% regulatory interactions confirmed by PWM,
50% by PWM in DHS (accessible TFBS)
50% regulatory interactions confirmed by GENIE3, so also regulated info well integrated in the network!
Also very good overlap with recent TF ChIP-Seq datasets.
>600 known TF-GO BP annotations recovered ; examples cover Abiotic stress, development, hormone signalling
Given the good performance to predict known TFs functions, we next wanted to experimentally evaluate novel functional predictions.
Excessive production of ROS can cause damage and lead to cell death, but on the other hand, ROS can also serve as signaling molecules in oxidative stress responses. We recently obtained a dataset of robust ROS marker genes, called the ROS wheel, and we were interested in identifying the regulators of these ROS genes.
We identified TFs for which the predicted target genes are enriched for ROS wheel markers and oxidative stress functions. We obtained a list of 124 TFs what we call the ROS-TFs and ranked based on ROS wheel target genes enrichment. In dark blue are 17 TFs that have been shown to have a phenotype under oxidative stress conditions. 58 TFs in light blue play a role under other stress conditions. The 49 TFs in orange have no known role in oxidative stress and these are our novel candidate ROS-TFs that we want to functionally validate.
Considering the top 124 TFs,
17 TFs with known oxidative stress function (dark blue) – low ranks!
58 TFs with other stress function
49 TFs no known role in ROS/environmental stress
For experimental validation, we selected the top 32 novel ROS-TFs, for which at least one Arabidopsis loss- or gain-of-function transgenic line was available. Plant performance under oxidative stress was assessed based on rosette growth of the mutant lines compared to the wild type when grown at low and high MV and 3-AT concentrations, respectively, using automated image analysis. In addition, bleaching, indicating chlorosis symptoms, were visually scored.
ROS causing agents:
MV : methyl viologen
3-AT: 3-amino triazole
For 13 of the 32 phenotyped novel ROS TFs we found at least one mutant allele with significant oxidative stress-induced changes in rosette area relative to the wild type, thus indicating a genotype effect dependent on the oxidative stress treatment and/or genotype-by-treatment effect under at least one of the tested stress treatments. Explain figure.
For SCL13, WRKY45, BEH2 and PHL1, experimental evidence is provided by at least two mutant alleles.