1. This document provides a detailed protocol for performing GO and KEGG enrichment analysis on gene lists from rice (Oryza sativa).
2. It describes obtaining GO and KEGG annotations from public databases and an R package, and using the clusterProfiler package in R for enrichment analysis and visualization of results.
3. GO enrichment analysis using self-curated annotation files from rice databases identified 29 enriched GO terms, while KEGG enrichment analysis identified 11 enriched pathways.
2. 2
Background
RNA-seq data analysis has been streamlined, and functional enrichment analysis is a critical step to provide
biological insights into the results. Enrichment analysis, or over-representative analysis, is to examine whether a
gene ontology or a biological pathway is enriched in the target gene list more than is expected by chance. Many
tools were developed to contain both annotation files and enrichment test functions to streamline this process.
However, some plant species may still lack of gene annotation information, which could be an obstacle for the
functional enrichment analysis. For instance, only 20 GO annotation databases were available under OrgDb from
Bioconductor, where only one is the plant species Arabidopsis. In this protocol, I focus on performing functional
enrichment analysis on genes of rice, a model organism for the grass family, using one of the most commonly used
enrichment analysis R software clusterProfiler (Yu et al., 2012). I provide a step-by-step instruction using annotation
information obtained from two different ways. The scripts are mainly the R scripts, with some Bash command lines
for curating a GO annotation file.
Software
1. clusterProfiler (Yu et al., 2012; v3.16.1;
https://guangchuangyu.github.io/software/clusterProfiler/documentation/)
2. GO.db (Carlson et al., 2019; v3.11.4;
https://bioconductor.org/packages/release/data/annotation/html/GO.db.html)
3. AnnotationHub (Morgan et al., 2021, v 2.20.1,
https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html)
4. dplyr (R package, v1.0.7)
5. data.table (R package, v1.14.0)
6. ggplot2 (R package, v3.3.5)
Input data:
1. Target gene list (genes.txt), background gene list (bkgd.txt, optional but recommended). The gene IDs are the
RAP IDs in this protocol, e.g., Os01g0102500, Os01g0106300.
2. The gene annotation file obtained from The Rice Annotation Project (RAP) Database (RAP-DB), including
the GO annotation information and RAP gene ID to transcript ID conversion information. This file is a large
data table, where each row is an individual transcript ID, and each column is a gene annotation information,
and "GO" is the column that contains the GO annotations, which are extracted into self-curated annotation
files. https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_representative_annotation_2021-11-
11.tsv.gz
3. The gene annotation file from the OryzaBase website. This file is also a large data table, where each row is for
a "Trait Gene ID", with annotations of "RAP ID" and "Gene Ontology", which are used for generating self-
curated annotation files. https://shigen.nig.ac.jp/rice/oryzabase/download/gene
4. RAP ID to Entrez ID conversion table from the He Lab at Fujian Agriculture and Forestry University, China.
http://bioinformatics.fafu.edu.cn/riceidtable/
Procedure
Case study:
1. GO enrichment analysis using self-curated annotation files.
a. Prepare rice gene GO annotation files curated from public annotation databases.
Extract the rice gene GO annotations from two public rice gene annotation databases, RAP-DB and
OryzaBase, and then combine them together as one single GO annotation file.
The RAP-DB GO annotation is collected by the following steps. First, download the genome annotation
file from the RAP-DB website, and then unzip the file. Next, extract the GO annotation information and
3. 3
make it a two-column tabular file (GO ID, Gene ID). The OryzaBase GO annotation is gathered in the
same way. The two files are then combined together into one file with duplicated rows removed. The
format of the output annotation file is shown in Table 1.
Table 1. Gene annotations table containing the GO ID to gene ID mapping information.
GO ID Gene ID
GO:0000003 Os02g0242600
GO:0000003 Os02g0268100
GO:0000003 Os02g0281000
GO:0000023 Os02g0729400
GO:0000023 Os04g0459900
b. Prepare the GO ID to GO name mapping files (in total of three, one for each GO subcategory). Also,
split the curated GO annotation file into three files based on GO subcategories.
First, obtain the GO ID to GO name mapping information from “GO.db” R package. This file provides
the TERM2NAME information in the universal enrichment function from clusterProfiler software.
The format of the file is shown in Table 2.
4. 4
Next, split the GO annotation file (Step 1a) into three subcategories, i.e., Biological Process (BP),
Molecular Function (MF), Cellular Component (CC). These tables provide the TERM2GENE
information in the universal enrichment function from clusterProfiler. The format of these tables is
the same as the original GO annotation table (Table 1).
Table 2. GO ID to GO name mapping table.
5. 5
GO ID GO name
GO:0000001 mitochondrion inheritance
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000011 vacuole inheritance
GO:0000012 single strand break repair
c. Read the input gene lists and the self-provided GO annotation files. Here, the GO: BP annotation file
is used to perform GO enrichment analysis, specially for the BP category.
d. Run universal enrichment function from clusterProfiler package, enricher. Save and visualize the
results.
6. 6
2. GO enrichment analysis using annotations from AnnotationHub package.
a. Read in the required input data, including input gene lists and OrgDb object.
First, find rice OrgDb object from AnnotationHub. There are three databases available for Oryza
Sativa japonica subspecies, where I pick the first one as all three are very similar. As the key types in
the OrgDb do not include the “RAP” ID type, the gene IDs are further converted into the “Entrez” ID
type using a rice ID mapping table obtained from the He Lab at Fujian Agriculture and Forestry
University, China. (Input data 4).
7. 7
b. Run enrichment function, enrichGO.
Select the BP ontology in the input parameter. Here, no enriched GO terms are found in the result.
8. 8
3. KEGG enrichment analysis
The clusterProfiler package can automatically retrieve KEGG annotation data from the KEGG database
when running the KEGG enrichment analysis function, enrichKEGG. The KEGG database contains large
number of organisms, including 687 Eukaryotes, 6664 Bacteria, and 369 Archaea, where rice is covered.
The following steps show how to run KEGG enrichment analysis with clusterProfiler using annotations
from the KEGG database.
a. Find the KEGG organism information from KEGG database.
https://www.genome.jp/kegg/catalog/org_list.html. There are “dosa” and “osa” for rice genome. I
select “dosa” (KEGG Genes Database: T02163, Oryza sativa japonica (Japanese rice) (RAPDB)).
Next, check the ID type used in “dosa”. Here, the ID type is the RAP transcript ID. Accordingly, input
genes’ IDs are converted to the RAP transcript IDs using the RAP ID mapping table (Input data 2).
9. 9
b. Run the enrichment test function enrichKEGG, and then save and visualize the results.
## R
## run enrichKEGG
kegg <- enrichKEGG(gene = genes_transID, # a vector of gene id
universe = bkgd_transID, # background genes
organism = "dosa", # kegg organism
keyType = "kegg", # keytype of input gene
pvalueCutoff = 0.05, # p-value cutoff (default)
pAdjustMethod = "BH", # multiple testing correction method to calculate adjusted p-
value (default)
qvalueCutoff = 0.2, # q-value cutoff (default). q value: local FDR corrected p-value.
minGSSize = 10, # minimal size of genes annotated for testing (default)
maxGSSize = 500 # maximal size of genes annotated for testing (default)
)
## save results
kegg_df <- as.data.frame(kegg)
write.table(kegg_df, "output/kegg_df.txt", sep="t", row.names=FALSE, quote=FALSE)
## dot plot
p1 <- dotplot(kegg, showCategory=10,
title = "Top 10 most statistically significant enriched KEGG terms",
font.size = 10)
ggsave(p1,
filename = "figures/kegg_dotplot.png",
height = 12,width = 22,units = "cm")
10. 10
Results interpretation:
With the input gene list, twenty-nine GO terms are enriched from the GO enrichment analysis using the self-provided
annotation files. The result table of the top 10 most statistically significant enriched GO terms (BP) ranked by p-
values is shown in Figure 1; the top results are visualized by the dot plot, where the terms were re-ordered by
GeneRatios (Figure 2). To note, not all genes have GO annotations, which leads to less genes used in the test, i.e.,
the input gene number changes from 3368 to 1699, and the background gene number changes from 24891 to 13440,
respectively.
Figure 1. Screenshot of the result table showing the top 10 most statistically significant enriched GO terms
(BP).
The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes;
BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing
corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method;
qvalue: the local FDR corrected p-value.
Figure 2. Visualizing the top 10 most statistically significant enriched GO terms (BP) by dot plot.
The terms are ordered by GeneRatio (the ratio of the input genes with the target GO annotation to the total number
of input genes).
In the KEGG enrichment analysis result, eleven KEGG terms are enriched, and the top 10 most statically significant
enriched KEGG pathways are listed in the table, ranked by p-values (Figure 3). These enriched KEGG pathways
are visualized by the dot plot, where the terms are re-ordered by GeneRatios (Figure 4).
11. 11
Figure 3. Screenshot of the result table showing the top 10 most statistically significant enriched KEGG
pathway terms.
The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes;
BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing
corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method;
qvalue: the local FDR corrected p-value.
Figure 4. Visualizing the top 10 most statistically significant enriched KEGG pathway terms by dot plot.
The terms are ordered by GeneRatio (the ratio of the input genes with the target KEGG annotation to the total
number of input genes).
Discussion:
Here, I provide a detailed protocol to perform GO and KEGG enrichment analysis for genes of the rice species. I
find that using self-collected GO annotation files outperforms the annotation file from the AnnotationHub, which
should be used as the preferred approach here. The self-provided GO annotated files are collected from two rice
annotation databases, the RAP-DB and OryzaBase. If there are more well-maintained sources of annotations, they
should be added as well to gain a more comprehensive result.
Rice is relatively well-annotated among all plant species, making curating annotation file from public sources
possible. However, when there is no available annotation files for the species of target, then researchers can acquire
the predicted gene annotations based on protein sequence alignment to that of the annotated species using software
such eggNOG-mapper, Interproscan. Lastly, it is worth noting that several web applications have been developed
for functional enrichment analysis, with built-in annotation files for some plant species, such as agriGO and
12. 12
g:Profiler. Research can refer to those tools to check if the plant species under study is covered.
Acknowledgments
The author thanks He Lab at Fujian Agriculture and Forestry University, China, for its RAP ID to Entrez ID
conversion table which was applied in this protocol.
Competing interests
The author declares no conflict of interest.
References
Yu, G., Wang, L. G., Han, Y. and He, Q. Y. (2012). clusterProfiler: an R package for comparing biological themes
among gene clusters. OMICS 16(5): 284-287.
Carlson. M. (2019). GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.8.2.
Morgan. M. and Shepherd, L. (2022). AnnotationHub: Client to access AnnotationHub resources. R package version
3.4.0.
Supplementary information
1. Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio-
protocol/GO_KEGG_enrichment_analysis_for_rice_genes.git.