SlideShare a Scribd company logo
1 of 12
Download to read offline
Copyright: © 2022 The Authors; exclusive licensee Bio-protocol LLC.
`
1
GO/KEGG Enrichment Analysis on Gene
Lists from Rice (Oryza Sativa)
Yahui Li*
Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China
*For correspondence: yahuili2009@gmail.com
Abstract
Keywords: GO, KEGG, Functional enrichment analysis, clusterProfiler, Rice, RNA-seq
In RNA-seq data analysis, functional enrichment analysis on genes has become a routine. Many enrichment analysis
software and web-applications have emerged. However, gene annotation information is only easily accessible for
the most well-studied organisms, such as human and mouse, but is lacking for some plant species. With poor gene
annotation information, performing a functional enrichment analysis is challenging. As such, I use rice, a mode plant
organism, as an example to show how to obtain comprehensive Gene Ontology (GO) and Kyoto Encyclopedia of
Genes and Genomes (KEGG) pathway annotation for the enrichment analysis. I obtain the gene annotation
information from two sources, 1. rice public annotation databases, including RAP-DB and OryzaBase; and 2. a R
package containing gene annotation information of various species, i.e., AnnotationHub. I utilize clusterProfiler R
package for the enrichment calculation and result visualization. This protocol can be directly used for GO/KEGG
enrichment analysis on gene lists from rice, and can also be used as a reference for similar analysis on other plant
species.
2
Background
RNA-seq data analysis has been streamlined, and functional enrichment analysis is a critical step to provide
biological insights into the results. Enrichment analysis, or over-representative analysis, is to examine whether a
gene ontology or a biological pathway is enriched in the target gene list more than is expected by chance. Many
tools were developed to contain both annotation files and enrichment test functions to streamline this process.
However, some plant species may still lack of gene annotation information, which could be an obstacle for the
functional enrichment analysis. For instance, only 20 GO annotation databases were available under OrgDb from
Bioconductor, where only one is the plant species Arabidopsis. In this protocol, I focus on performing functional
enrichment analysis on genes of rice, a model organism for the grass family, using one of the most commonly used
enrichment analysis R software clusterProfiler (Yu et al., 2012). I provide a step-by-step instruction using annotation
information obtained from two different ways. The scripts are mainly the R scripts, with some Bash command lines
for curating a GO annotation file.
Software
1. clusterProfiler (Yu et al., 2012; v3.16.1;
https://guangchuangyu.github.io/software/clusterProfiler/documentation/)
2. GO.db (Carlson et al., 2019; v3.11.4;
https://bioconductor.org/packages/release/data/annotation/html/GO.db.html)
3. AnnotationHub (Morgan et al., 2021, v 2.20.1,
https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html)
4. dplyr (R package, v1.0.7)
5. data.table (R package, v1.14.0)
6. ggplot2 (R package, v3.3.5)
Input data:
1. Target gene list (genes.txt), background gene list (bkgd.txt, optional but recommended). The gene IDs are the
RAP IDs in this protocol, e.g., Os01g0102500, Os01g0106300.
2. The gene annotation file obtained from The Rice Annotation Project (RAP) Database (RAP-DB), including
the GO annotation information and RAP gene ID to transcript ID conversion information. This file is a large
data table, where each row is an individual transcript ID, and each column is a gene annotation information,
and "GO" is the column that contains the GO annotations, which are extracted into self-curated annotation
files. https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_representative_annotation_2021-11-
11.tsv.gz
3. The gene annotation file from the OryzaBase website. This file is also a large data table, where each row is for
a "Trait Gene ID", with annotations of "RAP ID" and "Gene Ontology", which are used for generating self-
curated annotation files. https://shigen.nig.ac.jp/rice/oryzabase/download/gene
4. RAP ID to Entrez ID conversion table from the He Lab at Fujian Agriculture and Forestry University, China.
http://bioinformatics.fafu.edu.cn/riceidtable/
Procedure
Case study:
1. GO enrichment analysis using self-curated annotation files.
a. Prepare rice gene GO annotation files curated from public annotation databases.
Extract the rice gene GO annotations from two public rice gene annotation databases, RAP-DB and
OryzaBase, and then combine them together as one single GO annotation file.
The RAP-DB GO annotation is collected by the following steps. First, download the genome annotation
file from the RAP-DB website, and then unzip the file. Next, extract the GO annotation information and
3
make it a two-column tabular file (GO ID, Gene ID). The OryzaBase GO annotation is gathered in the
same way. The two files are then combined together into one file with duplicated rows removed. The
format of the output annotation file is shown in Table 1.
Table 1. Gene annotations table containing the GO ID to gene ID mapping information.
GO ID Gene ID
GO:0000003 Os02g0242600
GO:0000003 Os02g0268100
GO:0000003 Os02g0281000
GO:0000023 Os02g0729400
GO:0000023 Os04g0459900
b. Prepare the GO ID to GO name mapping files (in total of three, one for each GO subcategory). Also,
split the curated GO annotation file into three files based on GO subcategories.
First, obtain the GO ID to GO name mapping information from “GO.db” R package. This file provides
the TERM2NAME information in the universal enrichment function from clusterProfiler software.
The format of the file is shown in Table 2.
4
Next, split the GO annotation file (Step 1a) into three subcategories, i.e., Biological Process (BP),
Molecular Function (MF), Cellular Component (CC). These tables provide the TERM2GENE
information in the universal enrichment function from clusterProfiler. The format of these tables is
the same as the original GO annotation table (Table 1).
Table 2. GO ID to GO name mapping table.
5
GO ID GO name
GO:0000001 mitochondrion inheritance
GO:0000002 mitochondrial genome maintenance
GO:0000003 reproduction
GO:0000011 vacuole inheritance
GO:0000012 single strand break repair
c. Read the input gene lists and the self-provided GO annotation files. Here, the GO: BP annotation file
is used to perform GO enrichment analysis, specially for the BP category.
d. Run universal enrichment function from clusterProfiler package, enricher. Save and visualize the
results.
6
2. GO enrichment analysis using annotations from AnnotationHub package.
a. Read in the required input data, including input gene lists and OrgDb object.
First, find rice OrgDb object from AnnotationHub. There are three databases available for Oryza
Sativa japonica subspecies, where I pick the first one as all three are very similar. As the key types in
the OrgDb do not include the “RAP” ID type, the gene IDs are further converted into the “Entrez” ID
type using a rice ID mapping table obtained from the He Lab at Fujian Agriculture and Forestry
University, China. (Input data 4).
7
b. Run enrichment function, enrichGO.
Select the BP ontology in the input parameter. Here, no enriched GO terms are found in the result.
8
3. KEGG enrichment analysis
The clusterProfiler package can automatically retrieve KEGG annotation data from the KEGG database
when running the KEGG enrichment analysis function, enrichKEGG. The KEGG database contains large
number of organisms, including 687 Eukaryotes, 6664 Bacteria, and 369 Archaea, where rice is covered.
The following steps show how to run KEGG enrichment analysis with clusterProfiler using annotations
from the KEGG database.
a. Find the KEGG organism information from KEGG database.
https://www.genome.jp/kegg/catalog/org_list.html. There are “dosa” and “osa” for rice genome. I
select “dosa” (KEGG Genes Database: T02163, Oryza sativa japonica (Japanese rice) (RAPDB)).
Next, check the ID type used in “dosa”. Here, the ID type is the RAP transcript ID. Accordingly, input
genes’ IDs are converted to the RAP transcript IDs using the RAP ID mapping table (Input data 2).
9
b. Run the enrichment test function enrichKEGG, and then save and visualize the results.
## R
## run enrichKEGG
kegg <- enrichKEGG(gene = genes_transID, # a vector of gene id
universe = bkgd_transID, # background genes
organism = "dosa", # kegg organism
keyType = "kegg", # keytype of input gene
pvalueCutoff = 0.05, # p-value cutoff (default)
pAdjustMethod = "BH", # multiple testing correction method to calculate adjusted p-
value (default)
qvalueCutoff = 0.2, # q-value cutoff (default). q value: local FDR corrected p-value.
minGSSize = 10, # minimal size of genes annotated for testing (default)
maxGSSize = 500 # maximal size of genes annotated for testing (default)
)
## save results
kegg_df <- as.data.frame(kegg)
write.table(kegg_df, "output/kegg_df.txt", sep="t", row.names=FALSE, quote=FALSE)
## dot plot
p1 <- dotplot(kegg, showCategory=10,
title = "Top 10 most statistically significant enriched KEGG terms",
font.size = 10)
ggsave(p1,
filename = "figures/kegg_dotplot.png",
height = 12,width = 22,units = "cm")
10
Results interpretation:
With the input gene list, twenty-nine GO terms are enriched from the GO enrichment analysis using the self-provided
annotation files. The result table of the top 10 most statistically significant enriched GO terms (BP) ranked by p-
values is shown in Figure 1; the top results are visualized by the dot plot, where the terms were re-ordered by
GeneRatios (Figure 2). To note, not all genes have GO annotations, which leads to less genes used in the test, i.e.,
the input gene number changes from 3368 to 1699, and the background gene number changes from 24891 to 13440,
respectively.
Figure 1. Screenshot of the result table showing the top 10 most statistically significant enriched GO terms
(BP).
The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes;
BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing
corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method;
qvalue: the local FDR corrected p-value.
Figure 2. Visualizing the top 10 most statistically significant enriched GO terms (BP) by dot plot.
The terms are ordered by GeneRatio (the ratio of the input genes with the target GO annotation to the total number
of input genes).
In the KEGG enrichment analysis result, eleven KEGG terms are enriched, and the top 10 most statically significant
enriched KEGG pathways are listed in the table, ranked by p-values (Figure 3). These enriched KEGG pathways
are visualized by the dot plot, where the terms are re-ordered by GeneRatios (Figure 4).
11
Figure 3. Screenshot of the result table showing the top 10 most statistically significant enriched KEGG
pathway terms.
The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes;
BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing
corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method;
qvalue: the local FDR corrected p-value.
Figure 4. Visualizing the top 10 most statistically significant enriched KEGG pathway terms by dot plot.
The terms are ordered by GeneRatio (the ratio of the input genes with the target KEGG annotation to the total
number of input genes).
Discussion:
Here, I provide a detailed protocol to perform GO and KEGG enrichment analysis for genes of the rice species. I
find that using self-collected GO annotation files outperforms the annotation file from the AnnotationHub, which
should be used as the preferred approach here. The self-provided GO annotated files are collected from two rice
annotation databases, the RAP-DB and OryzaBase. If there are more well-maintained sources of annotations, they
should be added as well to gain a more comprehensive result.
Rice is relatively well-annotated among all plant species, making curating annotation file from public sources
possible. However, when there is no available annotation files for the species of target, then researchers can acquire
the predicted gene annotations based on protein sequence alignment to that of the annotated species using software
such eggNOG-mapper, Interproscan. Lastly, it is worth noting that several web applications have been developed
for functional enrichment analysis, with built-in annotation files for some plant species, such as agriGO and
12
g:Profiler. Research can refer to those tools to check if the plant species under study is covered.
Acknowledgments
The author thanks He Lab at Fujian Agriculture and Forestry University, China, for its RAP ID to Entrez ID
conversion table which was applied in this protocol.
Competing interests
The author declares no conflict of interest.
References
Yu, G., Wang, L. G., Han, Y. and He, Q. Y. (2012). clusterProfiler: an R package for comparing biological themes
among gene clusters. OMICS 16(5): 284-287.
Carlson. M. (2019). GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.8.2.
Morgan. M. and Shepherd, L. (2022). AnnotationHub: Client to access AnnotationHub resources. R package version
3.4.0.
Supplementary information
1. Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio-
protocol/GO_KEGG_enrichment_analysis_for_rice_genes.git.

More Related Content

What's hot

What's hot (20)

Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Structure of plasma membrane
Structure of plasma membraneStructure of plasma membrane
Structure of plasma membrane
 
An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"An Introduction to "Bioinformatics & Internet"
An Introduction to "Bioinformatics & Internet"
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Kegg databse
Kegg databseKegg databse
Kegg databse
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources
 
Identification Of Recombinants In Bacterial Cells
Identification Of Recombinants In Bacterial CellsIdentification Of Recombinants In Bacterial Cells
Identification Of Recombinants In Bacterial Cells
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
BioEdit
BioEditBioEdit
BioEdit
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Multiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALTMultiple Sequence Alignment Tool Using NCBI COBALT
Multiple Sequence Alignment Tool Using NCBI COBALT
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Cell wall in plants
Cell wall in plantsCell wall in plants
Cell wall in plants
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
9. the origin of cells
9. the origin of cells9. the origin of cells
9. the origin of cells
 
CELL CYCLE
CELL CYCLECELL CYCLE
CELL CYCLE
 
The Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resourcesThe Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resources
 

Similar to Rice GO/KEGG Enrichment Using ClusterProfiler

Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
ICAR2016 TAIR talk
ICAR2016 TAIR talkICAR2016 TAIR talk
ICAR2016 TAIR talkDonghui Li
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...Phoenix Bioinformatics
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseJennifer Smith
 
Rap db(rice annotation project data base)
Rap db(rice annotation project data base)Rap db(rice annotation project data base)
Rap db(rice annotation project data base)PrajaktaKale17
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeChunlei Wu
 
XMLPipeDB
XMLPipeDBXMLPipeDB
XMLPipeDBbosc
 
LODを用いたバイオインフォマティクスアプリケーション
LODを用いたバイオインフォマティクスアプリケーションLODを用いたバイオインフォマティクスアプリケーション
LODを用いたバイオインフォマティクスアプリケーションKazuki Oshita
 

Similar to Rice GO/KEGG Enrichment Using ClusterProfiler (20)

GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
ICAR2016 TAIR talk
ICAR2016 TAIR talkICAR2016 TAIR talk
ICAR2016 TAIR talk
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Automated data pipelines at the rat genome database
Automated data pipelines at the rat genome databaseAutomated data pipelines at the rat genome database
Automated data pipelines at the rat genome database
 
Rap db(rice annotation project data base)
Rap db(rice annotation project data base)Rap db(rice annotation project data base)
Rap db(rice annotation project data base)
 
Harvester I
Harvester IHarvester I
Harvester I
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
3302 3305
3302 33053302 3305
3302 3305
 
XMLPipeDB
XMLPipeDBXMLPipeDB
XMLPipeDB
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
LODを用いたバイオインフォマティクスアプリケーション
LODを用いたバイオインフォマティクスアプリケーションLODを用いたバイオインフォマティクスアプリケーション
LODを用いたバイオインフォマティクスアプリケーション
 
Rp 3010 5814
Rp 3010 5814Rp 3010 5814
Rp 3010 5814
 
Retrieval and Statistical Analysis of Genbank Data (RASA-GD)
Retrieval and Statistical Analysis of Genbank Data (RASA-GD)Retrieval and Statistical Analysis of Genbank Data (RASA-GD)
Retrieval and Statistical Analysis of Genbank Data (RASA-GD)
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Rice GO/KEGG Enrichment Using ClusterProfiler

  • 1. Copyright: © 2022 The Authors; exclusive licensee Bio-protocol LLC. ` 1 GO/KEGG Enrichment Analysis on Gene Lists from Rice (Oryza Sativa) Yahui Li* Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, China *For correspondence: yahuili2009@gmail.com Abstract Keywords: GO, KEGG, Functional enrichment analysis, clusterProfiler, Rice, RNA-seq In RNA-seq data analysis, functional enrichment analysis on genes has become a routine. Many enrichment analysis software and web-applications have emerged. However, gene annotation information is only easily accessible for the most well-studied organisms, such as human and mouse, but is lacking for some plant species. With poor gene annotation information, performing a functional enrichment analysis is challenging. As such, I use rice, a mode plant organism, as an example to show how to obtain comprehensive Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotation for the enrichment analysis. I obtain the gene annotation information from two sources, 1. rice public annotation databases, including RAP-DB and OryzaBase; and 2. a R package containing gene annotation information of various species, i.e., AnnotationHub. I utilize clusterProfiler R package for the enrichment calculation and result visualization. This protocol can be directly used for GO/KEGG enrichment analysis on gene lists from rice, and can also be used as a reference for similar analysis on other plant species.
  • 2. 2 Background RNA-seq data analysis has been streamlined, and functional enrichment analysis is a critical step to provide biological insights into the results. Enrichment analysis, or over-representative analysis, is to examine whether a gene ontology or a biological pathway is enriched in the target gene list more than is expected by chance. Many tools were developed to contain both annotation files and enrichment test functions to streamline this process. However, some plant species may still lack of gene annotation information, which could be an obstacle for the functional enrichment analysis. For instance, only 20 GO annotation databases were available under OrgDb from Bioconductor, where only one is the plant species Arabidopsis. In this protocol, I focus on performing functional enrichment analysis on genes of rice, a model organism for the grass family, using one of the most commonly used enrichment analysis R software clusterProfiler (Yu et al., 2012). I provide a step-by-step instruction using annotation information obtained from two different ways. The scripts are mainly the R scripts, with some Bash command lines for curating a GO annotation file. Software 1. clusterProfiler (Yu et al., 2012; v3.16.1; https://guangchuangyu.github.io/software/clusterProfiler/documentation/) 2. GO.db (Carlson et al., 2019; v3.11.4; https://bioconductor.org/packages/release/data/annotation/html/GO.db.html) 3. AnnotationHub (Morgan et al., 2021, v 2.20.1, https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html) 4. dplyr (R package, v1.0.7) 5. data.table (R package, v1.14.0) 6. ggplot2 (R package, v3.3.5) Input data: 1. Target gene list (genes.txt), background gene list (bkgd.txt, optional but recommended). The gene IDs are the RAP IDs in this protocol, e.g., Os01g0102500, Os01g0106300. 2. The gene annotation file obtained from The Rice Annotation Project (RAP) Database (RAP-DB), including the GO annotation information and RAP gene ID to transcript ID conversion information. This file is a large data table, where each row is an individual transcript ID, and each column is a gene annotation information, and "GO" is the column that contains the GO annotations, which are extracted into self-curated annotation files. https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_representative_annotation_2021-11- 11.tsv.gz 3. The gene annotation file from the OryzaBase website. This file is also a large data table, where each row is for a "Trait Gene ID", with annotations of "RAP ID" and "Gene Ontology", which are used for generating self- curated annotation files. https://shigen.nig.ac.jp/rice/oryzabase/download/gene 4. RAP ID to Entrez ID conversion table from the He Lab at Fujian Agriculture and Forestry University, China. http://bioinformatics.fafu.edu.cn/riceidtable/ Procedure Case study: 1. GO enrichment analysis using self-curated annotation files. a. Prepare rice gene GO annotation files curated from public annotation databases. Extract the rice gene GO annotations from two public rice gene annotation databases, RAP-DB and OryzaBase, and then combine them together as one single GO annotation file. The RAP-DB GO annotation is collected by the following steps. First, download the genome annotation file from the RAP-DB website, and then unzip the file. Next, extract the GO annotation information and
  • 3. 3 make it a two-column tabular file (GO ID, Gene ID). The OryzaBase GO annotation is gathered in the same way. The two files are then combined together into one file with duplicated rows removed. The format of the output annotation file is shown in Table 1. Table 1. Gene annotations table containing the GO ID to gene ID mapping information. GO ID Gene ID GO:0000003 Os02g0242600 GO:0000003 Os02g0268100 GO:0000003 Os02g0281000 GO:0000023 Os02g0729400 GO:0000023 Os04g0459900 b. Prepare the GO ID to GO name mapping files (in total of three, one for each GO subcategory). Also, split the curated GO annotation file into three files based on GO subcategories. First, obtain the GO ID to GO name mapping information from “GO.db” R package. This file provides the TERM2NAME information in the universal enrichment function from clusterProfiler software. The format of the file is shown in Table 2.
  • 4. 4 Next, split the GO annotation file (Step 1a) into three subcategories, i.e., Biological Process (BP), Molecular Function (MF), Cellular Component (CC). These tables provide the TERM2GENE information in the universal enrichment function from clusterProfiler. The format of these tables is the same as the original GO annotation table (Table 1). Table 2. GO ID to GO name mapping table.
  • 5. 5 GO ID GO name GO:0000001 mitochondrion inheritance GO:0000002 mitochondrial genome maintenance GO:0000003 reproduction GO:0000011 vacuole inheritance GO:0000012 single strand break repair c. Read the input gene lists and the self-provided GO annotation files. Here, the GO: BP annotation file is used to perform GO enrichment analysis, specially for the BP category. d. Run universal enrichment function from clusterProfiler package, enricher. Save and visualize the results.
  • 6. 6 2. GO enrichment analysis using annotations from AnnotationHub package. a. Read in the required input data, including input gene lists and OrgDb object. First, find rice OrgDb object from AnnotationHub. There are three databases available for Oryza Sativa japonica subspecies, where I pick the first one as all three are very similar. As the key types in the OrgDb do not include the “RAP” ID type, the gene IDs are further converted into the “Entrez” ID type using a rice ID mapping table obtained from the He Lab at Fujian Agriculture and Forestry University, China. (Input data 4).
  • 7. 7 b. Run enrichment function, enrichGO. Select the BP ontology in the input parameter. Here, no enriched GO terms are found in the result.
  • 8. 8 3. KEGG enrichment analysis The clusterProfiler package can automatically retrieve KEGG annotation data from the KEGG database when running the KEGG enrichment analysis function, enrichKEGG. The KEGG database contains large number of organisms, including 687 Eukaryotes, 6664 Bacteria, and 369 Archaea, where rice is covered. The following steps show how to run KEGG enrichment analysis with clusterProfiler using annotations from the KEGG database. a. Find the KEGG organism information from KEGG database. https://www.genome.jp/kegg/catalog/org_list.html. There are “dosa” and “osa” for rice genome. I select “dosa” (KEGG Genes Database: T02163, Oryza sativa japonica (Japanese rice) (RAPDB)). Next, check the ID type used in “dosa”. Here, the ID type is the RAP transcript ID. Accordingly, input genes’ IDs are converted to the RAP transcript IDs using the RAP ID mapping table (Input data 2).
  • 9. 9 b. Run the enrichment test function enrichKEGG, and then save and visualize the results. ## R ## run enrichKEGG kegg <- enrichKEGG(gene = genes_transID, # a vector of gene id universe = bkgd_transID, # background genes organism = "dosa", # kegg organism keyType = "kegg", # keytype of input gene pvalueCutoff = 0.05, # p-value cutoff (default) pAdjustMethod = "BH", # multiple testing correction method to calculate adjusted p- value (default) qvalueCutoff = 0.2, # q-value cutoff (default). q value: local FDR corrected p-value. minGSSize = 10, # minimal size of genes annotated for testing (default) maxGSSize = 500 # maximal size of genes annotated for testing (default) ) ## save results kegg_df <- as.data.frame(kegg) write.table(kegg_df, "output/kegg_df.txt", sep="t", row.names=FALSE, quote=FALSE) ## dot plot p1 <- dotplot(kegg, showCategory=10, title = "Top 10 most statistically significant enriched KEGG terms", font.size = 10) ggsave(p1, filename = "figures/kegg_dotplot.png", height = 12,width = 22,units = "cm")
  • 10. 10 Results interpretation: With the input gene list, twenty-nine GO terms are enriched from the GO enrichment analysis using the self-provided annotation files. The result table of the top 10 most statistically significant enriched GO terms (BP) ranked by p- values is shown in Figure 1; the top results are visualized by the dot plot, where the terms were re-ordered by GeneRatios (Figure 2). To note, not all genes have GO annotations, which leads to less genes used in the test, i.e., the input gene number changes from 3368 to 1699, and the background gene number changes from 24891 to 13440, respectively. Figure 1. Screenshot of the result table showing the top 10 most statistically significant enriched GO terms (BP). The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes; BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method; qvalue: the local FDR corrected p-value. Figure 2. Visualizing the top 10 most statistically significant enriched GO terms (BP) by dot plot. The terms are ordered by GeneRatio (the ratio of the input genes with the target GO annotation to the total number of input genes). In the KEGG enrichment analysis result, eleven KEGG terms are enriched, and the top 10 most statically significant enriched KEGG pathways are listed in the table, ranked by p-values (Figure 3). These enriched KEGG pathways are visualized by the dot plot, where the terms are re-ordered by GeneRatios (Figure 4).
  • 11. 11 Figure 3. Screenshot of the result table showing the top 10 most statistically significant enriched KEGG pathway terms. The terms are ordered by p-values. GeneRatio: the ratio of input genes with the target annotation to all input genes; BgRatio: the ratio of background genes with the target annotation to all background genes; p.adjust: multiple testing corrected p-value using method specified by the “pAdjustMethod” argument. The default is the “BH” method; qvalue: the local FDR corrected p-value. Figure 4. Visualizing the top 10 most statistically significant enriched KEGG pathway terms by dot plot. The terms are ordered by GeneRatio (the ratio of the input genes with the target KEGG annotation to the total number of input genes). Discussion: Here, I provide a detailed protocol to perform GO and KEGG enrichment analysis for genes of the rice species. I find that using self-collected GO annotation files outperforms the annotation file from the AnnotationHub, which should be used as the preferred approach here. The self-provided GO annotated files are collected from two rice annotation databases, the RAP-DB and OryzaBase. If there are more well-maintained sources of annotations, they should be added as well to gain a more comprehensive result. Rice is relatively well-annotated among all plant species, making curating annotation file from public sources possible. However, when there is no available annotation files for the species of target, then researchers can acquire the predicted gene annotations based on protein sequence alignment to that of the annotated species using software such eggNOG-mapper, Interproscan. Lastly, it is worth noting that several web applications have been developed for functional enrichment analysis, with built-in annotation files for some plant species, such as agriGO and
  • 12. 12 g:Profiler. Research can refer to those tools to check if the plant species under study is covered. Acknowledgments The author thanks He Lab at Fujian Agriculture and Forestry University, China, for its RAP ID to Entrez ID conversion table which was applied in this protocol. Competing interests The author declares no conflict of interest. References Yu, G., Wang, L. G., Han, Y. and He, Q. Y. (2012). clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16(5): 284-287. Carlson. M. (2019). GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.8.2. Morgan. M. and Shepherd, L. (2022). AnnotationHub: Client to access AnnotationHub resources. R package version 3.4.0. Supplementary information 1. Data and code availability: All data and code have been deposited to GitHub: https://github.com/Bio- protocol/GO_KEGG_enrichment_analysis_for_rice_genes.git.