SlideShare a Scribd company logo
1 of 33
Genes, Genomes, and
Genomics
Bioinformatics in the Classroom
plagiarized from:
http://www.dnalc.org/bioinformatics/presentations
/hhmi_2003/2003_3.ppt
June, 2003
2
Two. Again …
Francis Collins, HGP
Craig Venter, Celera Inc.
3
What’s in a chromosome?
4
Hierarchical vs. Whole Genome
5
The value of genome sequences lies in
their annotation
 Annotation – Characterizing genomic
features using computational and
experimental methods
 Genes: Four levels of annotation
 Gene Prediction – Where are genes?
 What do they look like?
 Domains – What do the proteins do?
 Role – What pathway(s) involved in?
6
How many genes?
Consortium: 35,000 genes?
Celera: 30,000 genes?
Affymetrix: 60,000 human genes on
GeneChips?
Incyte and HGS: over 120,000 genes?
GenBank: 49,000 unique gene coding
sequences?
UniGene: > 89,000 clusters of unique
ESTs?
7
Current consensus (in flux …)
 15,000 known genes (similarity to
previously isolated genes and expressed
sequences from a large variety of different
organisms)
 17,000 predicted (GenScan, GeneFinder,
GRAIL)
 Based on and limited to previous
knowledge
8
How to we get from here …
9
to here,
10
 Complete DNA segments responsible to
make functional products
 Products
 Proteins
 Functional RNA molecules
 RNAi (interfering RNA)
 rRNA (ribosomal RNA)
 snRNA (small nuclear)
 snoRNA (small nucleolar)
 tRNA (transfer RNA)
What are genes? - 1
11
What are genes? - 2
 Definition vs. dynamic concept
 Consider
 Prokaryotic vs. eukaryotic gene models
 Introns/exons
 Posttranscriptional modifications
 Alternative splicing
 Differential expression
 Genes-in-genes
 Genes-ad-genes
 Posttranslational modifications
 Multi-subunit proteins
12
Prokaryotic gene model: ORF-genes
 “Small” genomes, high gene density
 Haemophilus influenza genome 85% genic
 Operons
 One transcript, many genes
 No introns.
 One gene, one protein
 Open reading frames
 One ORF per gene
 ORFs begin with start,
end with stop codon (def.)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
13
Eukaryotic gene model: spliced genes
 Posttranscriptional modification
 5’-CAP, polyA tail, splicing
 Open reading frames
 Mature mRNA contains ORF
 All internal exons contain open “read-through”
 Pre-start and post-stop sequences are UTRs
 Multiple translates
 One gene – many proteins via alternative splicing
14
Expansions and Clarifications
 ORFs
 Start – triplets – stop
 Prokaryotes: gene = ORF
 Eukaryotes: spliced genes or ORF genes
 Exons
 Remain after introns have been removed
 Flanking parts contain non-coding
sequence (5’- and 3’-UTRs)
15
Where do genes live?
 In genomes
 Example: human genome
 Ca. 3,200,000,000 base pairs
 25 chromosomes : 1-22, X, Y, mt
 28,000-45,000 genes (current estimate)
 128 nucleotides (RNA gene) – 2,800 kb (DMD)
 Ca. 25% of genome are genes (introns, exons)
 Ca. 1% of genome codes for amino acids (CDS)
 30 kb gene length (average)
 1.4 kb ORF length (average)
 3 transcripts per gene (average)
16
Sample genomes
Species Size Genes Genes/Mb
H.sapiens 3,200Mb 35,000 11
D.melanogaster 137Mb 13.338 97
C.elegans 85.5Mb 18,266 214
A.thaliana 115Mb 25,800 224
S.cerevisiae 15Mb 6,144 410
E.coli 4.6Mb 4,300 934
List of 68 eukaryotes, 141 bacteria, and 17 archaea at
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html
17
So much DNA – so “few” genes …
s
T
Genic
Intergenic
T C
18
Genomic sequence features
 Repeats (“Junk DNA”)
 Transposable elements, simple repeats
 RepeatMasker
 Genes
 Vary in density, length, structure
 Identification depends on evidence and methods and
may require concerted application of bioinformatics
methods and lab research
 Pseudo genes
 Look-a-likes of genes, obstruct gene finding efforts.
 Non-coding RNAs (ncRNA)
 tRNA, rRNA, snRNA, snoRNA, miRNA
 tRNASCAN-SE, COVE
19
 Homology-based gene prediction
 Similarity Searches (e.g. BLAST, BLAT)
 Genome Browsers
 RNA evidence (ESTs)
 Ab initio gene prediction
 Gene prediction programs
 Prokaryotes
 ORF identification
 Eukaryotes
 Promoter prediction
 PolyA-signal prediction
 Splice site, start/stop-codon predictions
Gene identification
20
Gene prediction through comparative genomics
 Highly similar (Conserved) regions
between two genomes are useful or else
they would have diverged
 If genomes are too closely related all
regions are similar, not just genes
 If genomes are too far apart, analogous
regions may be too dissimilar to be found
21
Genome Browsers
Generic Genome Browser (CSHL)
www.wormbase.org/db/seq/gbrowse
NCBI Map Viewer
www.ncbi.nlm.nih.gov/mapview/
Ensembl Genome Browser
www.ensembl.org/
Apollo Genome Browser
www.bdgp.org/annot/apollo/
UCSC Genome Browser
genome.ucsc.edu/cgi-bin/hgGateway?org=human
22
Gene discovery using ESTs
 Expressed Sequence Tags (ESTs)
represent sequences from expressed
genes.
 If region matches EST with high
stringency then region is probably a
gene or pseudo gene.
 EST overlapping exon boundary gives
an accurate prediction of exon boundary.
23
Ab initio gene prediction
 Prokaryotes
 ORF-Detectors
 Eukaryotes
 Position, extent & direction: through promoter
and polyA-signal predictors
 Structure: through splice site predictors
 Exact location of coding sequences: through
determination of relationships between
potential start codons, splice sites, ORFs,
and stop codons
24
Tools
 ORF detectors
 NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
 Promoter predictors
 CSHL: http://rulai.cshl.org/software/index1.htm
 BDGP: fruitfly.org/seq_tools/promoter.html
 ICG: TATA-Box predictor
 PolyA signal predictors
 CSHL: argon.cshl.org/tabaska/polyadq_form.html
 Splice site predictors
 BDGP: http://www.fruitfly.org/seq_tools/splice.html
 Start-/stop-codon identifiers
 DNALC: Translator/ORF-Finder
 BCM: Searchlauncher
25
How it works I – Motif identification
Exon-Intron Borders = Splice Sites
Exon Intron Exon
~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~
~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~
~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~
~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~
Splice site Splice site
Exon Intron Exon
~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~
~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~
~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~
~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~
Splice site Splice site
Motif Extraction Programs at http://www-btls.jst.go.jp/
26
How it works II - Movies
Pribnow-Box Finder 0/1
Pribnow-Box Finder all
27
How it works III – The (ugly) truth
28
Gene prediction programs
 Rule-based programs
 Use explicit set of rules to make decisions.
 Example: GeneFinder
 Neural Network-based programs
 Use data set to build rules.
 Examples: Grail, GrailEXP
 Hidden Markov Model-based programs
 Use probabilities of states and transitions
between these states to predict features.
 Examples: Genscan, GenomeScan
29
Evaluating prediction programs
 Sensitivity vs. Specificity
 Sensitivity
 How many genes were found out of all
present?
 Sn = TP/(TP+FN)
 Specificity
 How many predicted genes are indeed genes?
 Sp = TP/(TP+FP)
30
Gene prediction accuracies
 Nucleotide level: 95%Sn, 90%Sp (Lows less than
50%)
 Exon level: 75%Sn, 68%Sp (Lows less than 30%)
 Gene Level: 40% Sn, 30%Sp (Lows less than 10%)
 Programs that combine statistical evaluations with
similarity searches most powerful.
31
Common difficulties
 First and last exons difficult to annotate
because they contain UTRs.
 Smaller genes are not statistically significant so
they are thrown out.
 Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
 Masking repeats frequently removes potentially
indicative chunks from the untranslated regions
of genes that contain repetitive elements.
32
The annotation pipeline
 Mask repeats using RepeatMasker.
 Run sequence through several programs.
 Take predicted genes and do similarity
search against ESTs and genes from
other organisms.
 Do similarity search for non-coding
sequences to find ncRNA.
33
Annotation nomenclature
 Known Gene – Predicted gene matches the
entire length of a known gene.
 Putative Gene – Predicted gene contains region
conserved with known gene. Also referred to as
“like” or “similar to”.
 Unknown Gene – Predicted gene matches a
gene or EST of which the function is not known.
 Hypothetical Gene – Predicted gene that does
not contain significant similarity to any known
gene or EST.

More Related Content

Similar to Lecture bioinformatics Part2.next generation

RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities Paolo Dametto
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012Koppolu Ravi
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysisDr. Olusoji Adewumi
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
Human genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traitsHuman genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traitsgroovescience
 
Genotyping, linkage mapping and binary data
Genotyping, linkage mapping and binary dataGenotyping, linkage mapping and binary data
Genotyping, linkage mapping and binary dataFAO
 
Restriction mapping
Restriction mappingRestriction mapping
Restriction mappingArdraArdra1
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014Ek_Kul
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotationScott Dawson
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
 
Molecular markers by tahura mariyam ansari
Molecular markers by tahura mariyam ansariMolecular markers by tahura mariyam ansari
Molecular markers by tahura mariyam ansariTahura Mariyam Ansari
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生ysuzuki-naist
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 

Similar to Lecture bioinformatics Part2.next generation (20)

RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
Marker devt. workshop 27022012
Marker devt. workshop 27022012Marker devt. workshop 27022012
Marker devt. workshop 27022012
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Human genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traitsHuman genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traits
 
Genotyping, linkage mapping and binary data
Genotyping, linkage mapping and binary dataGenotyping, linkage mapping and binary data
Genotyping, linkage mapping and binary data
 
Restriction mapping
Restriction mappingRestriction mapping
Restriction mapping
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
Molecular markers by tahura mariyam ansari
Molecular markers by tahura mariyam ansariMolecular markers by tahura mariyam ansari
Molecular markers by tahura mariyam ansari
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 

More from MohamedHasan816582

Bioinformatic_Databases_2.ppt Bioinformatics
Bioinformatic_Databases_2.ppt BioinformaticsBioinformatic_Databases_2.ppt Bioinformatics
Bioinformatic_Databases_2.ppt BioinformaticsMohamedHasan816582
 
Next Generation Sequence Analysis and genomics
Next Generation Sequence Analysis and genomicsNext Generation Sequence Analysis and genomics
Next Generation Sequence Analysis and genomicsMohamedHasan816582
 
genomeannotation2013-140127002622-phpapp02.ppt
genomeannotation2013-140127002622-phpapp02.pptgenomeannotation2013-140127002622-phpapp02.ppt
genomeannotation2013-140127002622-phpapp02.pptMohamedHasan816582
 
Databases, bioinformatics, sequence analysis
Databases, bioinformatics, sequence analysisDatabases, bioinformatics, sequence analysis
Databases, bioinformatics, sequence analysisMohamedHasan816582
 
Nucleic_Acid_Databases, Bioinformatics, genome
Nucleic_Acid_Databases, Bioinformatics, genomeNucleic_Acid_Databases, Bioinformatics, genome
Nucleic_Acid_Databases, Bioinformatics, genomeMohamedHasan816582
 
Genes, Genomics, and Chromosomes computational biology introduction .ppt
Genes, Genomics, and Chromosomes computational biology introduction .pptGenes, Genomics, and Chromosomes computational biology introduction .ppt
Genes, Genomics, and Chromosomes computational biology introduction .pptMohamedHasan816582
 

More from MohamedHasan816582 (11)

Bioinformatic_Databases_2.ppt Bioinformatics
Bioinformatic_Databases_2.ppt BioinformaticsBioinformatic_Databases_2.ppt Bioinformatics
Bioinformatic_Databases_2.ppt Bioinformatics
 
Next Generation Sequence Analysis and genomics
Next Generation Sequence Analysis and genomicsNext Generation Sequence Analysis and genomics
Next Generation Sequence Analysis and genomics
 
genomeannotation2013-140127002622-phpapp02.ppt
genomeannotation2013-140127002622-phpapp02.pptgenomeannotation2013-140127002622-phpapp02.ppt
genomeannotation2013-140127002622-phpapp02.ppt
 
Databases, bioinformatics, sequence analysis
Databases, bioinformatics, sequence analysisDatabases, bioinformatics, sequence analysis
Databases, bioinformatics, sequence analysis
 
Nucleic_Acid_Databases, Bioinformatics, genome
Nucleic_Acid_Databases, Bioinformatics, genomeNucleic_Acid_Databases, Bioinformatics, genome
Nucleic_Acid_Databases, Bioinformatics, genome
 
Genes, Genomics, and Chromosomes computational biology introduction .ppt
Genes, Genomics, and Chromosomes computational biology introduction .pptGenes, Genomics, and Chromosomes computational biology introduction .ppt
Genes, Genomics, and Chromosomes computational biology introduction .ppt
 
protein.pptx
protein.pptxprotein.pptx
protein.pptx
 
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
 
protein Lec.1.ppt
protein Lec.1.pptprotein Lec.1.ppt
protein Lec.1.ppt
 
proteome.pdf
proteome.pdfproteome.pdf
proteome.pdf
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
 

Recently uploaded

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningMarc Dusseiller Dusjagr
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 

Recently uploaded (20)

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
dusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learningdusjagr & nano talk on open tools for agriculture research and learning
dusjagr & nano talk on open tools for agriculture research and learning
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in  Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in Uttam Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 

Lecture bioinformatics Part2.next generation

  • 1. Genes, Genomes, and Genomics Bioinformatics in the Classroom plagiarized from: http://www.dnalc.org/bioinformatics/presentations /hhmi_2003/2003_3.ppt June, 2003
  • 2. 2 Two. Again … Francis Collins, HGP Craig Venter, Celera Inc.
  • 3. 3 What’s in a chromosome?
  • 5. 5 The value of genome sequences lies in their annotation  Annotation – Characterizing genomic features using computational and experimental methods  Genes: Four levels of annotation  Gene Prediction – Where are genes?  What do they look like?  Domains – What do the proteins do?  Role – What pathway(s) involved in?
  • 6. 6 How many genes? Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique ESTs?
  • 7. 7 Current consensus (in flux …)  15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms)  17,000 predicted (GenScan, GeneFinder, GRAIL)  Based on and limited to previous knowledge
  • 8. 8 How to we get from here …
  • 10. 10  Complete DNA segments responsible to make functional products  Products  Proteins  Functional RNA molecules  RNAi (interfering RNA)  rRNA (ribosomal RNA)  snRNA (small nuclear)  snoRNA (small nucleolar)  tRNA (transfer RNA) What are genes? - 1
  • 11. 11 What are genes? - 2  Definition vs. dynamic concept  Consider  Prokaryotic vs. eukaryotic gene models  Introns/exons  Posttranscriptional modifications  Alternative splicing  Differential expression  Genes-in-genes  Genes-ad-genes  Posttranslational modifications  Multi-subunit proteins
  • 12. 12 Prokaryotic gene model: ORF-genes  “Small” genomes, high gene density  Haemophilus influenza genome 85% genic  Operons  One transcript, many genes  No introns.  One gene, one protein  Open reading frames  One ORF per gene  ORFs begin with start, end with stop codon (def.) TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
  • 13. 13 Eukaryotic gene model: spliced genes  Posttranscriptional modification  5’-CAP, polyA tail, splicing  Open reading frames  Mature mRNA contains ORF  All internal exons contain open “read-through”  Pre-start and post-stop sequences are UTRs  Multiple translates  One gene – many proteins via alternative splicing
  • 14. 14 Expansions and Clarifications  ORFs  Start – triplets – stop  Prokaryotes: gene = ORF  Eukaryotes: spliced genes or ORF genes  Exons  Remain after introns have been removed  Flanking parts contain non-coding sequence (5’- and 3’-UTRs)
  • 15. 15 Where do genes live?  In genomes  Example: human genome  Ca. 3,200,000,000 base pairs  25 chromosomes : 1-22, X, Y, mt  28,000-45,000 genes (current estimate)  128 nucleotides (RNA gene) – 2,800 kb (DMD)  Ca. 25% of genome are genes (introns, exons)  Ca. 1% of genome codes for amino acids (CDS)  30 kb gene length (average)  1.4 kb ORF length (average)  3 transcripts per gene (average)
  • 16. 16 Sample genomes Species Size Genes Genes/Mb H.sapiens 3,200Mb 35,000 11 D.melanogaster 137Mb 13.338 97 C.elegans 85.5Mb 18,266 214 A.thaliana 115Mb 25,800 224 S.cerevisiae 15Mb 6,144 410 E.coli 4.6Mb 4,300 934 List of 68 eukaryotes, 141 bacteria, and 17 archaea at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html
  • 17. 17 So much DNA – so “few” genes … s T Genic Intergenic T C
  • 18. 18 Genomic sequence features  Repeats (“Junk DNA”)  Transposable elements, simple repeats  RepeatMasker  Genes  Vary in density, length, structure  Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research  Pseudo genes  Look-a-likes of genes, obstruct gene finding efforts.  Non-coding RNAs (ncRNA)  tRNA, rRNA, snRNA, snoRNA, miRNA  tRNASCAN-SE, COVE
  • 19. 19  Homology-based gene prediction  Similarity Searches (e.g. BLAST, BLAT)  Genome Browsers  RNA evidence (ESTs)  Ab initio gene prediction  Gene prediction programs  Prokaryotes  ORF identification  Eukaryotes  Promoter prediction  PolyA-signal prediction  Splice site, start/stop-codon predictions Gene identification
  • 20. 20 Gene prediction through comparative genomics  Highly similar (Conserved) regions between two genomes are useful or else they would have diverged  If genomes are too closely related all regions are similar, not just genes  If genomes are too far apart, analogous regions may be too dissimilar to be found
  • 21. 21 Genome Browsers Generic Genome Browser (CSHL) www.wormbase.org/db/seq/gbrowse NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/ Ensembl Genome Browser www.ensembl.org/ Apollo Genome Browser www.bdgp.org/annot/apollo/ UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGateway?org=human
  • 22. 22 Gene discovery using ESTs  Expressed Sequence Tags (ESTs) represent sequences from expressed genes.  If region matches EST with high stringency then region is probably a gene or pseudo gene.  EST overlapping exon boundary gives an accurate prediction of exon boundary.
  • 23. 23 Ab initio gene prediction  Prokaryotes  ORF-Detectors  Eukaryotes  Position, extent & direction: through promoter and polyA-signal predictors  Structure: through splice site predictors  Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons
  • 24. 24 Tools  ORF detectors  NCBI: http://www.ncbi.nih.gov/gorf/gorf.html  Promoter predictors  CSHL: http://rulai.cshl.org/software/index1.htm  BDGP: fruitfly.org/seq_tools/promoter.html  ICG: TATA-Box predictor  PolyA signal predictors  CSHL: argon.cshl.org/tabaska/polyadq_form.html  Splice site predictors  BDGP: http://www.fruitfly.org/seq_tools/splice.html  Start-/stop-codon identifiers  DNALC: Translator/ORF-Finder  BCM: Searchlauncher
  • 25. 25 How it works I – Motif identification Exon-Intron Borders = Splice Sites Exon Intron Exon ~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~ ~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~ ~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~ ~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~ Splice site Splice site Exon Intron Exon ~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~ ~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~ ~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~ ~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~ Splice site Splice site Motif Extraction Programs at http://www-btls.jst.go.jp/
  • 26. 26 How it works II - Movies Pribnow-Box Finder 0/1 Pribnow-Box Finder all
  • 27. 27 How it works III – The (ugly) truth
  • 28. 28 Gene prediction programs  Rule-based programs  Use explicit set of rules to make decisions.  Example: GeneFinder  Neural Network-based programs  Use data set to build rules.  Examples: Grail, GrailEXP  Hidden Markov Model-based programs  Use probabilities of states and transitions between these states to predict features.  Examples: Genscan, GenomeScan
  • 29. 29 Evaluating prediction programs  Sensitivity vs. Specificity  Sensitivity  How many genes were found out of all present?  Sn = TP/(TP+FN)  Specificity  How many predicted genes are indeed genes?  Sp = TP/(TP+FP)
  • 30. 30 Gene prediction accuracies  Nucleotide level: 95%Sn, 90%Sp (Lows less than 50%)  Exon level: 75%Sn, 68%Sp (Lows less than 30%)  Gene Level: 40% Sn, 30%Sp (Lows less than 10%)  Programs that combine statistical evaluations with similarity searches most powerful.
  • 31. 31 Common difficulties  First and last exons difficult to annotate because they contain UTRs.  Smaller genes are not statistically significant so they are thrown out.  Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known.  Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements.
  • 32. 32 The annotation pipeline  Mask repeats using RepeatMasker.  Run sequence through several programs.  Take predicted genes and do similarity search against ESTs and genes from other organisms.  Do similarity search for non-coding sequences to find ncRNA.
  • 33. 33 Annotation nomenclature  Known Gene – Predicted gene matches the entire length of a known gene.  Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”.  Unknown Gene – Predicted gene matches a gene or EST of which the function is not known.  Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.