Gene identification using
Bioinformatic tools
1
Okechukwu Francis
Programme: PhD Biotechnology
SCHOOL OF HEALTH SCIENCE AND TECHNOLOGY (SoHST)
2
What is a Gene?
A gene is a region of DNA that encodes a function
(e.g proteins, mRNA, tRNA etc), proteins encoded by
genes may overlap and are responsible for the
inheritance of physical features.
What is a Genome?
This is the complete set of genetic material (DNA,
RNA) in an organism (Bacteria, human, etc).
In people, almost every cell in the body contains a
complete copy of the genome. The genome contains
all of the information needed for a person to develop
and grow.
3
What is RefSeq?
A reference sequence database is an open-access, annotated and curated collection of publicly
available nucleotide sequences and their protein products. RefSeq was first introduced in
2000.
Why are RefSeq important?
RefSeq sequences form a foundation for medical, functional, and diversity research. it
provides a stable reference for a known genome, gene identification and characterization,
mutation and polymorphism analysis (especially RefSeq Gene records), expression studies,
and comparative analyses.
Hence most organisms have a stable Sequence which can be compared to the sequence on
RefSeq and changes in this sequence can give important information like;
• Mutation in cancer patients
• identification of bacteria responsible for an infection
• Comparison between Organism strains.
4
How to identify genes using bioinformatics tools
Obtain the DNA or RNA sequence data from the organism
of interest through web lab research using various
sequencing technologies, such as Sanger sequencing or next-
generation sequencing (NGS) is first done to obtain the
sequence of the gene of interest.
Sequencing
Clean and process the sequence data to remove any errors or
artefacts introduced during sequencing, as well as any
sequences that do not align with the reference genome or
transcriptome.
Preprocessing
Annotate the genome or transcriptome to identify potential
genes and their features, such as coding regions, exons,
introns, promoters, and regulatory elements. This can be
done using tools such as Ensembl, NCBI's RefSeq, or UCSC
Genome Browser.
Gene identification
and annotation
5
 Use computational methods to predict the
location and structure of genes within the genome or
transcriptome. This can be done using tools such as
Augustus, GeneMark, or Glimmer.
Gene prediction
 Validate the predicted genes by comparing them
to known genes in related organisms, or by using
experimental techniques such as RNA sequencing or
reverse transcription polymerase chain reaction (RT-
PCR) to confirm their expression.
Validation
 Once genes have been identified, further
analysis can be done to determine their functions,
interactions, and pathways. This can be done using
tools such as Gene Ontology (GO) and KEGG
Pathway databases.
Functional
analysis
6
What is Genscan?
GENSCAN is a bioinformatics program designed and hosted by MIT to identify complete gene structures in
genomic DNA. It is a Generalized Hidden Markov Model (GHMM) based program that can be used to
predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of
organisms (mostly vertebrates, Arabidopsis and maize).
7
Promoter region identification
Promoter regions are DNA sequences located upstream of a gene that regulates gene
expression by binding to transcription factors and RNA polymerase to initiate
transcription.
8
Promoter region identification can be done using various bioinformatics tools and approaches, such as:
1. Promoter prediction tools: These are software tools that predict promoter regions based on sequence
features such as GC content, TATA boxes, and transcription factor binding sites. Examples of promoter
prediction tools include PromoterScan, Neural Network Promoter Prediction, and CpG Island
Searcher.
2. Comparative genomics: This involves comparing the genomes of related organisms to identify
conserved promoter regions. Promoter regions are often more conserved across species than other non-
coding regions of the genome, which can help identify potential promoter regions. Comparative
genomics can be done using tools such as BLAST and ClustalW.
3. Chromatin immunoprecipitation (ChIP): This is an experimental technique that can be used to identify
protein-DNA interactions in vivo, including transcription factor binding to promoter regions. ChIP can be
used to identify known and novel promoter regions in a specific cell type or tissue.
4. Gene expression analysis: Promoter regions are often associated with gene expression levels, and genes
with similar expression patterns may share common promoter regions. Gene expression analysis, such as
RNA sequencing or microarray analysis, can be used to identify potential promoter regions based on co-
expression patterns.
9
There are several methods that can be used to identify repeats in a genome, including;
1. Sequence Alignment: Repeats in a genome can be identified by aligning the genome sequence to itself. This
can be done using tools such as BLAST or Smith-Waterman algorithm. Repeats can be identified as regions
with high sequence similarity.
2. K-mer Analysis: K-mers are short sequences of DNA of a fixed length (usually 3-6 nucleotides). By counting
the occurrence of each k-mer in the genome, we can identify regions that are highly repetitive.
3. De Novo Assembly: De novo assembly is the process of reconstructing the genome sequence from a set of
short reads. Repeats can be identified by examining the assembly graph and identifying regions where the
reads form loops or bubbles.
4. RepeatMasker: RepeatMasker is a tool that identifies and masks repetitive elements in genomic sequences. It
compares the genome sequence to a library of known repetitive elements and identifies regions that match.
5. RepeatExplorer: RepeatExplorer is a tool that uses clustering algorithms to identify and classify repetitive
elements in genomic sequences. It can be used to visualize the repeat landscape of a genome and identify
novel repeat families.
These methods can be used individually or in combination to identify and annotate repeats in a genome.
10
How to Identify repeats in a genome
Identifying repeats in a genome is an important task in genomic analysis.
11
ORF Prediction
ORF (Open Reading Frame) prediction is the process of identifying potential protein-
coding regions within a genomic sequence. Here are some common methods for predicting
ORF sequences:
12
1. Start and Stop Codon Detection: ORFs typically begin with a start codon (ATG, AUG, or
rarely GUG) and end with a stop codon (TAA, TAG, or TGA). One approach to ORF
prediction is to scan the genomic sequence for potential start and stop codons and identify all
ORFs that are flanked by these codons.
2. Codon Usage Bias: ORFs in prokaryotic genomes often show a strong bias towards certain
codons. This bias can be used to predict ORFs by identifying regions of the genome with
codon usage patterns consistent with protein-coding regions.
3. Comparative Genomics: ORF prediction can also be aided by comparing the genome
sequence to related genomes. Conserved ORFs between species are more likely to be
protein-coding and can be used to guide ORF prediction in the target genome.
4. Machine Learning: Machine learning algorithms, such as Hidden Markov Models (HMMs)
and neural networks, can be trained on known protein-coding regions to predict ORFs in a
genome. These methods can also incorporate information from other genomic features, such
as codon usage and RNA secondary structure.
5. Gene Finding Software: There are many software tools available that use a combination of
the above methods to predict ORFs in a genome, such as Glimmer, GeneMark, and
Augustus.
13
Glimmer (Gene Locator and Interpolated Markov ModelER)
Glimmer is a popular software tool used in bioinformatics for gene prediction in bacterial and archaeal genomes.
Here are the steps for using Glimmer for gene prediction:
1. Input Sequence: The first step is to input the genomic sequence in FASTA format to Glimmer.
2. Training: Known gene sequence databases are used to predict genes in the genome. The test data can be
obtained from a related genome or from experimental data such as RNA-seq or proteomics.
3. Running Glimmer: Glimmer can be run to predict test genes in the genome. Glimmer outputs a set of
predicted genes in Glimmer format, which includes information such as the predicted start and stop codons,
the coding sequence, and the predicted gene product.
4. Post-Processing: The predicted genes can be further processed to remove false positives and to annotate the
genes with additional information such as gene ontology (GO) terms and functional annotations. This can be
done using tools such as BLAST and InterProScan.
14
CONCLUSION
It is important to note that, identifying genes using bioinformatics tools requires a combination
of computational and experimental techniques, as well as expertise in genomics, molecular
biology, and bioinformatics.
It is also important to note that no single method can accurately predict all ORFs, and multiple
approaches should be used to increase the accuracy of ORF prediction
Additionally, experimental validation, such as transcriptome sequencing or proteomics, is
necessary to confirm predicted ORFs.
Finally, perfectly mastering bioinformatics tools would require daily practice to better
understand how they work and how to interpret results.

Gene identification using bioinformatic tools.pptx

  • 1.
    Gene identification using Bioinformatictools 1 Okechukwu Francis Programme: PhD Biotechnology SCHOOL OF HEALTH SCIENCE AND TECHNOLOGY (SoHST)
  • 2.
    2 What is aGene? A gene is a region of DNA that encodes a function (e.g proteins, mRNA, tRNA etc), proteins encoded by genes may overlap and are responsible for the inheritance of physical features. What is a Genome? This is the complete set of genetic material (DNA, RNA) in an organism (Bacteria, human, etc). In people, almost every cell in the body contains a complete copy of the genome. The genome contains all of the information needed for a person to develop and grow.
  • 3.
    3 What is RefSeq? Areference sequence database is an open-access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. Why are RefSeq important? RefSeq sequences form a foundation for medical, functional, and diversity research. it provides a stable reference for a known genome, gene identification and characterization, mutation and polymorphism analysis (especially RefSeq Gene records), expression studies, and comparative analyses. Hence most organisms have a stable Sequence which can be compared to the sequence on RefSeq and changes in this sequence can give important information like; • Mutation in cancer patients • identification of bacteria responsible for an infection • Comparison between Organism strains.
  • 4.
    4 How to identifygenes using bioinformatics tools Obtain the DNA or RNA sequence data from the organism of interest through web lab research using various sequencing technologies, such as Sanger sequencing or next- generation sequencing (NGS) is first done to obtain the sequence of the gene of interest. Sequencing Clean and process the sequence data to remove any errors or artefacts introduced during sequencing, as well as any sequences that do not align with the reference genome or transcriptome. Preprocessing Annotate the genome or transcriptome to identify potential genes and their features, such as coding regions, exons, introns, promoters, and regulatory elements. This can be done using tools such as Ensembl, NCBI's RefSeq, or UCSC Genome Browser. Gene identification and annotation
  • 5.
    5  Use computationalmethods to predict the location and structure of genes within the genome or transcriptome. This can be done using tools such as Augustus, GeneMark, or Glimmer. Gene prediction  Validate the predicted genes by comparing them to known genes in related organisms, or by using experimental techniques such as RNA sequencing or reverse transcription polymerase chain reaction (RT- PCR) to confirm their expression. Validation  Once genes have been identified, further analysis can be done to determine their functions, interactions, and pathways. This can be done using tools such as Gene Ontology (GO) and KEGG Pathway databases. Functional analysis
  • 6.
    6 What is Genscan? GENSCANis a bioinformatics program designed and hosted by MIT to identify complete gene structures in genomic DNA. It is a Generalized Hidden Markov Model (GHMM) based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms (mostly vertebrates, Arabidopsis and maize).
  • 7.
    7 Promoter region identification Promoterregions are DNA sequences located upstream of a gene that regulates gene expression by binding to transcription factors and RNA polymerase to initiate transcription.
  • 8.
    8 Promoter region identificationcan be done using various bioinformatics tools and approaches, such as: 1. Promoter prediction tools: These are software tools that predict promoter regions based on sequence features such as GC content, TATA boxes, and transcription factor binding sites. Examples of promoter prediction tools include PromoterScan, Neural Network Promoter Prediction, and CpG Island Searcher. 2. Comparative genomics: This involves comparing the genomes of related organisms to identify conserved promoter regions. Promoter regions are often more conserved across species than other non- coding regions of the genome, which can help identify potential promoter regions. Comparative genomics can be done using tools such as BLAST and ClustalW. 3. Chromatin immunoprecipitation (ChIP): This is an experimental technique that can be used to identify protein-DNA interactions in vivo, including transcription factor binding to promoter regions. ChIP can be used to identify known and novel promoter regions in a specific cell type or tissue. 4. Gene expression analysis: Promoter regions are often associated with gene expression levels, and genes with similar expression patterns may share common promoter regions. Gene expression analysis, such as RNA sequencing or microarray analysis, can be used to identify potential promoter regions based on co- expression patterns.
  • 9.
    9 There are severalmethods that can be used to identify repeats in a genome, including; 1. Sequence Alignment: Repeats in a genome can be identified by aligning the genome sequence to itself. This can be done using tools such as BLAST or Smith-Waterman algorithm. Repeats can be identified as regions with high sequence similarity. 2. K-mer Analysis: K-mers are short sequences of DNA of a fixed length (usually 3-6 nucleotides). By counting the occurrence of each k-mer in the genome, we can identify regions that are highly repetitive. 3. De Novo Assembly: De novo assembly is the process of reconstructing the genome sequence from a set of short reads. Repeats can be identified by examining the assembly graph and identifying regions where the reads form loops or bubbles. 4. RepeatMasker: RepeatMasker is a tool that identifies and masks repetitive elements in genomic sequences. It compares the genome sequence to a library of known repetitive elements and identifies regions that match. 5. RepeatExplorer: RepeatExplorer is a tool that uses clustering algorithms to identify and classify repetitive elements in genomic sequences. It can be used to visualize the repeat landscape of a genome and identify novel repeat families. These methods can be used individually or in combination to identify and annotate repeats in a genome.
  • 10.
    10 How to Identifyrepeats in a genome Identifying repeats in a genome is an important task in genomic analysis.
  • 11.
    11 ORF Prediction ORF (OpenReading Frame) prediction is the process of identifying potential protein- coding regions within a genomic sequence. Here are some common methods for predicting ORF sequences:
  • 12.
    12 1. Start andStop Codon Detection: ORFs typically begin with a start codon (ATG, AUG, or rarely GUG) and end with a stop codon (TAA, TAG, or TGA). One approach to ORF prediction is to scan the genomic sequence for potential start and stop codons and identify all ORFs that are flanked by these codons. 2. Codon Usage Bias: ORFs in prokaryotic genomes often show a strong bias towards certain codons. This bias can be used to predict ORFs by identifying regions of the genome with codon usage patterns consistent with protein-coding regions. 3. Comparative Genomics: ORF prediction can also be aided by comparing the genome sequence to related genomes. Conserved ORFs between species are more likely to be protein-coding and can be used to guide ORF prediction in the target genome. 4. Machine Learning: Machine learning algorithms, such as Hidden Markov Models (HMMs) and neural networks, can be trained on known protein-coding regions to predict ORFs in a genome. These methods can also incorporate information from other genomic features, such as codon usage and RNA secondary structure. 5. Gene Finding Software: There are many software tools available that use a combination of the above methods to predict ORFs in a genome, such as Glimmer, GeneMark, and Augustus.
  • 13.
    13 Glimmer (Gene Locatorand Interpolated Markov ModelER) Glimmer is a popular software tool used in bioinformatics for gene prediction in bacterial and archaeal genomes. Here are the steps for using Glimmer for gene prediction: 1. Input Sequence: The first step is to input the genomic sequence in FASTA format to Glimmer. 2. Training: Known gene sequence databases are used to predict genes in the genome. The test data can be obtained from a related genome or from experimental data such as RNA-seq or proteomics. 3. Running Glimmer: Glimmer can be run to predict test genes in the genome. Glimmer outputs a set of predicted genes in Glimmer format, which includes information such as the predicted start and stop codons, the coding sequence, and the predicted gene product. 4. Post-Processing: The predicted genes can be further processed to remove false positives and to annotate the genes with additional information such as gene ontology (GO) terms and functional annotations. This can be done using tools such as BLAST and InterProScan.
  • 14.
    14 CONCLUSION It is importantto note that, identifying genes using bioinformatics tools requires a combination of computational and experimental techniques, as well as expertise in genomics, molecular biology, and bioinformatics. It is also important to note that no single method can accurately predict all ORFs, and multiple approaches should be used to increase the accuracy of ORF prediction Additionally, experimental validation, such as transcriptome sequencing or proteomics, is necessary to confirm predicted ORFs. Finally, perfectly mastering bioinformatics tools would require daily practice to better understand how they work and how to interpret results.