SlideShare a Scribd company logo
1 of 120
FBW
1-12-2015
Wim Van Criekinge
BPC 2015
*** ERGRO ***
BPC 2015
*** ERGRO *** 1. Longest English word where first three
letters are identical to the last three
2. English word where longest stretch of letters
are identical at beginning and at the end
3. In Dutch ?
4. Any other language
5. Biological relevance ?
Send before 1st of december to
wim.vancriekinge@gmail.com
Longest one wins, if same size first to submit
Dries Godderis
1. Langste engels woord waar 3 eerste letters = 3 laatste letters: antipredeterminant
(18)
2. Langste engels woord met langste gelijke stretch = benzeneazobenzene (17)
3. In nederlands langste woord met eerste 3 letters = 3 laatste letters:
tentoonstellingsprojecten (25)
In nederlands langste woord met langste gelijke stretch = dierentuindieren (16)
4. In portugees langste woord met eerste 3 letters = 3 laatste letters:
desconstitucionalizardes (24)
In portugees langste woord met langste gelijke stretch = reassenhoreasse (15)
(=vervoegd werkwoord van reassenhorear)
5. Biologische relevantie: bij een hairpin loop (stem-loop) model wordt de stabiliteit en
vorming van deze structuur bepaalt door de stabiliteit van de helix en de gevormde
loopregio's. Een gelijke stretch aan begin en eind van de sequentie zullen een
belangrijke rol spelen in een goede basepaarvorming
Exams
• Dates ?
• 1st question
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Gene Ontologies
Gene Prediction
Composite Gene Prediction
Non-coding RNA
HMM
UNKNOWN PROTEIN SEQUENCE
LOOK FOR:
• Similar sequences in databases ((PSI)
BLAST)
• Distinctive patterns/domains associated
with function
• Functionally important residues
• Secondary and tertiary structure
• Physical properties (hydrophobicity, IEP
etc)
BASIC INFORMATION COMES FROM SEQUENCE
• One sequence- can get some information eg
amino acid properties
• More than one sequence- get more info on
conserved residues, fold and function
• Multiple alignments of related sequences-
can build up consensus sequences of known
families, domains, motifs or sites.
• Sequence alignments can give information
on loops, families and function from
conserved regions
Additional analysis of protein sequences
• transmembrane
regions
• signal sequences
• localisation
signals
• targeting
sequences
• GPI anchors
• glycosylation sites
• hydrophobicity
• amino acid
composition
• molecular weight
• solvent accessibility
• antigenicity
FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES
• Pattern - short, simplest, but limited
• Motif - conserved element of a sequence
alignment, usually predictive of structural or
functional region
To get more information across whole
alignment:
• Profile
• HMM
PATTERNS
• Small, highly conserved regions
• Shown as regular expressions
Example:
[AG]-x-V-x(2)-x-{YW}
– [] shows either amino acid
– X is any amino acid
– X(2) any amino acid in the next 2 positions
– {} shows any amino acid except these
BUT- limited to near exact match in small
region
PROFILES
• Table or matrix containing comparison
information for aligned sequences
• Used to find sequences similar to
alignment rather than one sequence
• Contains same number of rows as
positions in sequences
• Row contains score for alignment of
position with each residue
HIDDEN MARKOV MODELS (HMM)
• An HMM is a large-scale profile with gaps,
insertions and deletions allowed in the
alignments, and built around probabilities
• Package used HMMER (http://hmmer.wusd.edu/)
• Start with one sequence or alignment -HMMbuild,
then calibrate with HMMcalibrate, search
database with HMM
• E-value- number of false matches expected with
a certain score
• Assume extreme value distribution for noise,
calibrate by searching random seq with HMM
build up curve of noise (EVD)
HMM
Sequence
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
What is an ontology?
• An ontology is an explicit
specification of a conceptualization.
• A conceptualization is an abstract,
simplified view of the world that we
want to represent.
• If the specification medium is a
formal representation, the ontology
defines the vocabulary.
Why Create Ontologies?
• to enable data exchange among
programs
• to simplify unification (or translation)
of disparate representations
• to employ knowledge-based services
• to embody the representation of a
theory
• to facilitate communication among
people
Summary
• Ontologies are what they do:
artifacts to help people and their
programs communicate, coordinate,
collaborate.
• Ontologies are essential elements in
the technological infrastructure of
the Knowledge Age
• http://www.geneontology.org/
•Molecular Function — elemental activity or task
nuclease, DNA binding, transcription factor
•Biological Process — broad objective or goal
mitosis, signal transduction, metabolism
•Cellular Component — location or complex
nucleus, ribosome, origin recognition complex
The Three Ontologies
DAG Structure
Directed acyclic graph: each child
may have one or more parents
Example - Molecular Function
Example - Biological Process
Example - Cellular Location
AmiGO browser
GO: Applications
• Eg. chip-data analysis: Overrepresented item
can provide functional clues
• Overrepresentation check: contingency table
– Chi-square test (or Fisher is frequency < 5)
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Web applications
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
Problem:
Given a very long DNA sequence, identify coding
regions (including intron splice sites) and their
predicted protein sequences
Computational Gene Finding
Eukaryotic gene structure
Computational Gene Finding
• There is no (yet known) perfect method
for finding genes. All approaches rely on
combining various “weak signals”
together
• Find elements of a gene
– coding sequences (exons)
– promoters and start signals
– poly-A tails and downstream signals
• Assemble into a consistent gene model
Computational Gene Finding
Genefinder
GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP
This gene structure corresponds to the position on the physical map
GENE STRUCTURE INFORMATION - ACTIVE ZONE
This gene structure shows the Active Zone
The Active Zone limits the extent of
analysis, genefinder & fasta dumps
A blue line within the yellow box
indicates regions outside of the active
zone
The active zone is set by entering
coordinates in the active zone (yellow
box)
GENE STRUCTURE INFORMATION - POSITION
This gene structure relates to the Position:
Change origin of
this scale by
entering a
number in the
green 'origin'
box
GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE
This gene structure relates to the predicted gene structures
Boxes are Exons,
thin lines (or
springs) are Introns
Find the open reading frames
GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT
Any sequence has 3 potential reading frames (+1, +2, +3)
Its complement also has three potential reading frames (-1, -2, -3)
6 possible reading frames
The triplet, non-punctuated nature of the genetic code helps us out
64 potential codons
61 true codons
3 stop codons (TGA, TAA, TAG)
Random distribution app. 1/21 codons will be a stop
E K A P A Q S E M V S L S F H R
K K L L P N L K W L A Y L S T
K S S C P I * N G * P I F P P
GENE STRUCTURE INFORMATION - OPEN READING FRAMES
This gene structure relates to Open reading Frames
There is one column
for each frame
Small horizontal
lines represent stop
codons
They have one
column for each
frame
The size indicates
relative score for the
particular start site
GENE STRUCTURE INFORMATION - START CODONS
This gene structure represents Start Codons
• Amino acid distributions are biased
e.g. p(A) > p(C)
• Pairwise distributions also biased
e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)]
• Nucleotides that code for preferred amino
acids (and AA pairs) occur more frequently in
coding regions than in non-coding regions.
• Codon biases (per amino acid)
• Hexanucleotide distributions that reflect those
biases indicate coding regions.
Computational Gene Finding: Hexanucleotide frequencies
Gene prediction
Generation of datasets (Ensmart@Ensembl):
Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900
coding regions (DNA):
Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of
>900 non-coding regions
Distance Array: Calculate for every base all the distances (in
bp) to the same nucleotide (focus on the first 1000 bp of the
coding region and limit the distance array to a window of
1000 bp)
Do you see a difference in this “distance array” between coding
and noncoding sequence ?
Could it be used to predict genes ?
Write a program to predict genes in the following genomic
sequence (http://biobix.ugent.be/txt/genomic.txt)
What else could help in finding genes in raw genomic
sequences ?
GENE STRUCTURE INFORMATION - CODING POTENTIAL
This gene structure corresponds to the Coding Potential
The grey boxes indicate
regions where the codon
frequencies match those of
known C. elegans genes.
the larger the grey box the
more this region resembles a
C. elegans coding element
blastn (EST)
For raw DNA sequence analysis blastx is
extremely useful
Will probe your DNA sequence against the protein database
A match (homolog) gives you some ideas regarding function
One problem are all of the genome sequences
Will get matches to genome databases that are strictly identified by
sequence homology – often you need some experimental evidence
GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY
This feature shows protein sequence similarity
The blue boxes indicate
regions of sequence which
when translated have
similarity to previously
characterised proteins.
To view the alignment,
select the right mouse
button whilst over the blue
box.
GENE STRUCTURE INFORMATION - EST MATCHES
This gene structure relates to Est Matches
The yellow boxes represent
DNA matches (Blast) to C.
elegans Expressed Sequence
Tags (ESTS)
To view the alignment use the
right mouse button whilst
over the yellow box to invoke
Blixem
Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34
New generation of programs to predict gene coding
sequences based on a non-random repeat pattern
(eg. Glimmer, GeneMark) – actually pretty good
• CpG islands are regions of sequence that
have a high proportion of CG dinucleotide
pairs (p is a phoshodiester bond linking
them)
– CpG islands are present in the promoter and
exonic regions of approximately 40% of
mammalian genes
– Other regions of the mammalian genome contain
few CpG dinucleotides and these are largely
methylated
• Definition: sequences of >500 bp with
– G+C > 55%
– Observed(CpG)/Expected(CpG) > 0.65
Computational Gene Finding
GENE STRUCTURE INFORMATION - REPEAT FAMILIES
This gene structure corresponds to Repeat Families
This column shows
matches to members of a
number of repeat families
Currently a hidden markov
model is used to detect
these
GENE STRUCTURE INFORMATION - REPEATS
This gene structure relates to Repeats
This column shows regions
of localised repeats both
tandem and inverted
Clicking on the boxes will
show the complete repeat
information in the blue line
at the top end of the screen
Exon/intron boundaries
• Most Eukaryotic introns have a
consensus splice signal: GU at the
beginning (“donor”), AG at the end
(“acceptor”).
• Variation does occur in the splice sites
• Many AGs and GTs are not splice sites.
• Database of experimentally validated
human splice sites:
http://www.ebi.ac.uk/~thanaraj/splice.h
tml
Computational Gene Finding: Splice junctions
GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES
This gene structure shows putative splice sites
The Splice Sites are shown
'Hooked'
The Hook points in the
direction of splicing, therefore
3' splice sites point up and 5'
Splice sites point down
The colour of the Splice Site
indicates the position at which
it interrupts the Codon
The height of the Splices is
proportional to the Genefinder
score of the Splice Site
Gene Prediction, HMM & ncRNA
What to do with an unknown
sequence ?
Web applications
Gene Ontologies
Gene Prediction
HMM
Composite Gene Prediction
Non-coding RNA
• Recall that profiles are matrices that
identify the probability of seeing an
amino acid at a particular location in a
motif.
• What about motifs that allow insertions
or deletions (together, called indels)?
• Patterns and regular expressions can
handle these easily, but profiles are
more flexible.
• Can indels be integrated into profiles?
Towards profiles (PSSM) with indels – insertions and/or deletions
• Need a representation that allows
specification of the probability of
introducing (and/or extending) a gap in
the profile.
A .1
C .05
D .2
E .08
F .01
Gap A .04
C .1
D .01
E .2
F .02
Gap A .2
C .01
D .05
E .1
F .06
delete
continue
Hidden Markov Models: Graphical models of sequences
• A sequence is said to be Markovian if the
probability of the occurrence of an element in
a particular position depends only on the
previous elements in the sequence.
• Order of a Markov chain depends on how
many previous elements influence probability
– 0th order: uniform probability at every position
– 1st order: probability depends only on immediately
previous position.
• 1st order Markov chains are good for proteins.
Hidden Markov Chain
Marchov Chain for DNA
Markov chain with begin and end
• Consists of states (boxes) and transitions
(arcs) labeled with probabilities
• States have probability(s) of “emitting” an
element of a sequence (or nothing).
• Arcs have probability of moving from one
state to another.
– Sum of probabilities of all out arcs must be 1
– Self-loops (e.g. gap extend) are OK.
Markov Models: Graphical models of sequences
• Simplest example: Each state emits (or,
equivalently, recognizes) a particular
element with probability 1, and each
transition is equally likely.
Example sequences: 1234 234 14 121214 2123334
Begi
n
Emit 1
Emit 2
Emit 4
Emit 3
End
Markov Models
• Now, add probabilities to each transition (let
emission remain a single element)
• We can calculate the probability of any sequence given this
model by multiplying
0.5
0.5
0.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
Emit 1
Emit 2
Emit 4
Emit 3
End
p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03
p(14) = 0.5 * 0.9 = 0.45
p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06
Hidden Markov Models: Probabilistic Markov Models
• If we let the states define a set of emission
probabilities for elements, we can no longer be
sure which state we are in given a particular
element of a sequence
BCCD or BCCD ?
0.5
0.5
0.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
A (0.8) B(0.2)
B (0.7) C(0.3)
C (0.1) D (0.9)
C (0.6) A(0.4)
End
Hidden Markov Models: Probablistic Emmision
• Emission uncertainty means the sequence doesn't
identify a unique path. The states are “hidden”
• Probability of a sequence is sum of all paths that can
produce it:
0.5
0.5
0.25
0.75
0.9
0.1
0.2
0.8
1.0Begi
n
A (0.8) B(0.2)
B (0.7) C(0.3)
C (0.1) D (0.9)
C (0.6) A(0.4)
End
p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9
+ 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9
= 0.000972 + 0.013608 = 0.01458
Hidden Markov Models
Hidden Markov Models
Hidden Markov Models: The occasionally dishonest casino
Hidden Markov Models: The occasionally dishonest casino
• The HMM must first be “trained” using a training set
– Eg. database of known genes.
– Consensus sequences for all signal sensors are needed.
– Compositional rules (i.e., emission probabilities) and
length distributions are necessary for content sensors.
• Transition probabilities between all connected
states must be estimated.
• Estimate the probability of sequence s, given model
m, P(s|m)
– Multiply probabilities along most likely path
(or add logs – less numeric error)
Use of Hidden Markov Models
• HMMs are effectively profiles with gaps, and
have applications throughout Bioinformatics
• Protein sequence applications:
– MSAs and identifying distant homologs
E.g. Pfam uses HMMs to define its MSAs
– Domain definitions
– Used for fold recognition in protein structure
prediction
• Nucleotide sequence applications:
– Models of exons, genes, etc. for gene
recognition.
Applications of Hidden Markov Models
• UC Santa Cruz (David Haussler group)
– SAM-02 server. Returns alignments, secondary
structure predictions, HMM parameters, etc. etc.
– SAM HMM building program
(requires free academic license)
• Washington U. St. Louis (Sean Eddy group)
– Pfam. Large database of precomputed HMM-based
alignments of proteins
– HMMer, program for building HMMs
• Gene finders and other HMMs (more later)
Hidden Markov Models Resources
Example TMHMM
Beyond Kyte-Doolitlle …
HMM in protein analysis
• http://www.cse.ucsc.edu/research/compbio/is
mb99.handouts/KK185FP.html
Hidden Markov model for gene structure
• A representation of the linguistic rules for what features might follow
what other features when parsing a sequence consisting of a multiple
exon gene.
• A candidate gene structure is created by tracing a path from B to F.
• A hidden Markov model (or hidden semi-Markov model) is defined by
attaching stochastic models to each of the arcs and nodes.
Signals (blue nodes):
• begin sequence (B)
• start translation (S)
• donor splice site (D)
• acceptor splice site (A)
• stop translation (T)
• end sequence (F)
Contents (red arcs):
• 5’ UTR (J5’)
• initial exon (EI)
• exon (E)
• intron (I)
• final exon (EF)
• single exon (ES)
• 3’ UTR (J3’)
Classic Programs for gene finding
Some of the best programs are HMM based:
• GenScan – http://genes.mit.edu/GENSCAN.html
• GeneMark – http://opal.biology.gatech.edu/GeneMark/
Other programs
• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3,
GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail
II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
GENSCAN
not to be confused with GeneScan, a commercial product
• A Semi-Markov Model
– Explicit model of how long
to stay in a state (rather
than just self-loops, which
must be exponentially
decaying)
• Tracks “phase” of exon or
intron (0 coincides with codon
boundary, or 1 or 2)
• Tracks strand (and direction)
Hidden Markov Models: Gene Finding Software
Conservation of Gene Features
Conservation pattern across 3165 mappings of human
RefSeq mRNAs to the genome. A program sampled 200
evenly spaced bases across 500 bases upstream of
transcription, the 5’ UTR, the first coding exon, introns,
middle coding exons, introns, the 3’ UTR and 500 bases
after polyadenylatoin. There are peaks of conservation at the
transition from one region to another.
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
aligning identity
Composite Approaches
• Use EST info to constrain HMMs (Genie)
• Use protein homology info on top of HMMs
(fgenesh++, GenomeScan)
• Use cross species genomic alignments on top
of HMMs (twinscan, fgenesh2, SLAM, SGP)
Gene Prediction: more complex …
1. Species specific
2. Splicing enhancers found in coding regions
3. Trans-splicing
4. …
Length preference
5’ ss intcomp branch 3’ ss
Contents-Schedule
RNA genes
Besides the 6000 protein coding-genes, there is:
140 ribosomal RNA genes
275 transfer RNA gnes
40 small nuclear RNA genes
>100 small nucleolar genes
?
pRNA in 29 rotary packaging motor (Simpson
et el. Nature 408:745-750,2000)
Cartilage-hair hypoplasmia mapped to an RNA
(Ridanpoa et al. Cell 104:195-203,2001)
The human Prader-Willi ciritical region (Cavaille
et al. PNAS 97:14035-7, 2000)
RNA genes can be hard to detects
UGAGGUAGUAGGUUGUAUAGU
C.elegans let-27; 21 nt
(Pasquinelli et al. Nature 408:86-89,2000)
Often small
Sometimes multicopy and redundant
Often not polyadenylated
(not represented in ESTs)
Immune to frameshift and nonsense mutations
No open reading frame, no codon bias
Often evolving rapidly in primary sequence
miRNA genes
• Lin-4 identified in a screen for mutations that affect timing and
sequence of postembryonic development in C.elegans. Mutants re-
iterate L1 instead of later stages of development
• Gene positionally cloned by isolating a 693-bp DNA fragment that
can rescue the phenotype of mutant animals
• No protein found but 61-nucleotide precursor RNA with stem-loop
structure which is processed to 22-mer ncRNA
• Genetically lin-4 acts as negative regulator of lin-14 and lin-28
• The 3’ UTR of the target genes have short stretches of
complementarity to lin-4
• Deletion of these lin-4 target seq causes unregulated gof phenotype
• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins
although the target mRNA
Lin-4
Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-
nucleotide product
The small let-7 RNA is also thought to be a post-transcriptional
negative regulator for lin-41 and lin-42
100% conserved in all bilaterally symmetrical animals (not
jellyfish and sponges)
Sometimes called stRNAs, small temporal RNAs
Let-7
(Pasquinelli et al. Nature 408:86-89,2000)
Two computational analysis problems
• Similarity search (eg BLAST), I give you a query,
you find sequences in a database that look like the
query (note: SW/Blat)
– For RNA, you want to take the secondary structure of
the query into account
• Genefinding. Based solely on a priori knowledge
of what a “gene” looks like, find genes in a
genome sequence
– For RNA, with no open reading frame and no codon
bias, what do you look for ?
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
Basic CFG
“production rules”
S -> aS
S -> Sa
S -> aSu
S -> SS
Context-free grammers
A CFG “derivation”
S -> aS
S -> aaS
S -> aaSS
S -> aagScuS
S -> aagaSucugSc
S -> aagaSaucuggScc
S -> aagacSgaucuggcgSccc
S -> aagacuSgaucuggcgSccc
S -> aagacuuSgaucuggcgaSccc
S -> aagacuucSgaucuggcgacSccc
S -> aagacuucgSgaucuggcgacaSccc
S -> aagacuucggaucuggcgacaccc
A
C
G
U
*
A
A
A
A
A
G
G
G G G
C
C
C
C
CCC
U
U
U
*
*
* * *
The power of comparative analysis
• Comparative genome analysis is an indispensable means of
inferring whether a locus produces a ncRNA as opposed to
encoding a protein.
• For a small gene to be called a protein-coding gene, one
excellent line of evidence is that the ORF is significantly
conserved in another related species.
• It is more difficult to positively corroborate a ncRNA by
comparative analysis but, in at least some cases, a ncRNA
might conserve an intramolecular secondary structure and
comparative analysis can show compensatory base
substitutions.
• With comparative genome sequence data now
accumulating in the public domain for most if not all
important genetic systems, comparative analysis can (and
should) become routine.
Compensatory substitutions
that maintain the structure
U U
C G
U A
A U
G C
A UCGAC 3’
G C
5’
Evolutionary conservation of RNA molecules can be revealed
by identification of compensatory substitutions
…………
• Manual annotation of 60,770 full-length mouse complementary
DNA sequences, clustered into 33,409 ‘transcriptional units’,
contributing 90.1% of a newly established mouse transcriptome
database.
• Of these transcriptional units, 4,258 are new protein-coding and
11,665 are new non-coding messages, indicating that non-coding
RNA is a major component of the transcriptome.
Function on ncRNAs
ncRNAs & RNAi
Therapeutic Applications
• Shooting millions of tiny RNA molecules into a
mouse’s bloodstream can protect its liver from the
ravages of hepatitis, a new study shows. In this
case, they blunt the liver’s selfdestructive
inflammatory response, which can be triggered by
agents such as the hepatitis B or C viruses.
(Harvard University immunologists Judy
Lieberman and Premlata Shankar)
• In a series of experiments published online this
week by Nature Medicine, Lieberman’s team gave
mice injections of siRNAs designed to shut down a
gene called Fas. When overactivated during an
inflammatory response, it induces liver cells to
self-destruct. The next day, the animals were given
an antibody that sends Fas into hyperdrive. Control
mice died of acute liver failure within a few days,
but 82% of the siRNA-treated mice remained free
of serious disease and survived. Between 80% and
90% of their liver cells had incorporated the
siRNAs.
2015 bioinformatics go_hmm_wim_vancriekinge

More Related Content

What's hot

2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekingeProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014Ek_Kul
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Vladimir Morozov
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 

What's hot (20)

2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
Kulakova sbb2014
Kulakova sbb2014Kulakova sbb2014
Kulakova sbb2014
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Topological associated domains- Hi-C
Topological associated domains- Hi-CTopological associated domains- Hi-C
Topological associated domains- Hi-C
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Overview of ChIA-PET tools
Overview of ChIA-PET toolsOverview of ChIA-PET tools
Overview of ChIA-PET tools
 
Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...Analysis and visualization of microarray experiment data integrating Pipeline...
Analysis and visualization of microarray experiment data integrating Pipeline...
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
Kishor Presentation
Kishor PresentationKishor Presentation
Kishor Presentation
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 

Viewers also liked

Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningLuis Goldster
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceLuis Goldster
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheHarry Potter
 
презентация лц кор фин_25.11.16
презентация лц кор фин_25.11.16презентация лц кор фин_25.11.16
презентация лц кор фин_25.11.16zhussupova
 
Human Resource Management
Human Resource ManagementHuman Resource Management
Human Resource ManagementSelf employed
 
Green Printing at UK Government Department [Infographic]
Green Printing at UK Government Department [Infographic]Green Printing at UK Government Department [Infographic]
Green Printing at UK Government Department [Infographic]Chief Optimist
 
LA CRISI DE LA RESTAURACIÓ (1898-1931)
LA CRISI DE LA RESTAURACIÓ (1898-1931)LA CRISI DE LA RESTAURACIÓ (1898-1931)
LA CRISI DE LA RESTAURACIÓ (1898-1931)Gemma Ajenjo Rodriguez
 
Tha price of a g.pt.3.newer.html.doc
Tha price of a g.pt.3.newer.html.docTha price of a g.pt.3.newer.html.doc
Tha price of a g.pt.3.newer.html.docMCDub
 
Tha price of health.pt.3.newer.html.doc
Tha price of health.pt.3.newer.html.docTha price of health.pt.3.newer.html.doc
Tha price of health.pt.3.newer.html.docMCDub
 
Tha price of wisdom.pt.3.newer.html.doc
Tha price of wisdom.pt.3.newer.html.docTha price of wisdom.pt.3.newer.html.doc
Tha price of wisdom.pt.3.newer.html.docMCDub
 
Art romànic i gòtic
Art romànic i gòticArt romànic i gòtic
Art romànic i gòticconxa1
 

Viewers also liked (18)

Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Ccbb según lomce
Ccbb según lomceCcbb según lomce
Ccbb según lomce
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Behavioral Assessment
Behavioral AssessmentBehavioral Assessment
Behavioral Assessment
 
презентация лц кор фин_25.11.16
презентация лц кор фин_25.11.16презентация лц кор фин_25.11.16
презентация лц кор фин_25.11.16
 
Smm & caching
Smm & cachingSmm & caching
Smm & caching
 
2015 bioinformatics bio_python_part4
2015 bioinformatics bio_python_part42015 bioinformatics bio_python_part4
2015 bioinformatics bio_python_part4
 
04 uni 11352 parte 2
04 uni 11352 parte 204 uni 11352 parte 2
04 uni 11352 parte 2
 
Human Resource Management
Human Resource ManagementHuman Resource Management
Human Resource Management
 
Green Printing at UK Government Department [Infographic]
Green Printing at UK Government Department [Infographic]Green Printing at UK Government Department [Infographic]
Green Printing at UK Government Department [Infographic]
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
LA CRISI DE LA RESTAURACIÓ (1898-1931)
LA CRISI DE LA RESTAURACIÓ (1898-1931)LA CRISI DE LA RESTAURACIÓ (1898-1931)
LA CRISI DE LA RESTAURACIÓ (1898-1931)
 
Tha price of a g.pt.3.newer.html.doc
Tha price of a g.pt.3.newer.html.docTha price of a g.pt.3.newer.html.doc
Tha price of a g.pt.3.newer.html.doc
 
Tha price of health.pt.3.newer.html.doc
Tha price of health.pt.3.newer.html.docTha price of health.pt.3.newer.html.doc
Tha price of health.pt.3.newer.html.doc
 
Tha price of wisdom.pt.3.newer.html.doc
Tha price of wisdom.pt.3.newer.html.docTha price of wisdom.pt.3.newer.html.doc
Tha price of wisdom.pt.3.newer.html.doc
 
Art romànic i gòtic
Art romànic i gòticArt romànic i gòtic
Art romànic i gòtic
 
Ibèria entre els segles VIII-XI
Ibèria entre els segles VIII-XIIbèria entre els segles VIII-XI
Ibèria entre els segles VIII-XI
 

Similar to 2015 bioinformatics go_hmm_wim_vancriekinge

Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
Tag snp selection using quine mc cluskey optimization method-2
Tag snp selection using quine mc cluskey optimization method-2Tag snp selection using quine mc cluskey optimization method-2
Tag snp selection using quine mc cluskey optimization method-2IAEME Publication
 
Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyijfcstjournal
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptxSilpa87
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 

Similar to 2015 bioinformatics go_hmm_wim_vancriekinge (20)

Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
 
Bioinformatica t8-go-hmm
Bioinformatica t8-go-hmmBioinformatica t8-go-hmm
Bioinformatica t8-go-hmm
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
Tag snp selection using quine mc cluskey optimization method-2
Tag snp selection using quine mc cluskey optimization method-2Tag snp selection using quine mc cluskey optimization method-2
Tag snp selection using quine mc cluskey optimization method-2
 
Dna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancyDna data compression algorithms based on redundancy
Dna data compression algorithms based on redundancy
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 

More from Prof. Wim Van Criekinge

2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_uploadProf. Wim Van Criekinge
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Prof. Wim Van Criekinge
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_uploadProf. Wim Van Criekinge
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_uploadProf. Wim Van Criekinge
 

More from Prof. Wim Van Criekinge (20)

2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload2019 03 05_biological_databases_part5_v_upload
2019 03 05_biological_databases_part5_v_upload
 
2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload2019 02 21_biological_databases_part2_v_upload
2019 02 21_biological_databases_part2_v_upload
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
P7 2018 biopython3
P7 2018 biopython3P7 2018 biopython3
P7 2018 biopython3
 
P6 2018 biopython2b
P6 2018 biopython2bP6 2018 biopython2b
P6 2018 biopython2b
 
P4 2018 io_functions
P4 2018 io_functionsP4 2018 io_functions
P4 2018 io_functions
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload2018 02 20_biological_databases_part2_v_upload
2018 02 20_biological_databases_part2_v_upload
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
P7 2017 biopython3
P7 2017 biopython3P7 2017 biopython3
P7 2017 biopython3
 
P6 2017 biopython2
P6 2017 biopython2P6 2017 biopython2
P6 2017 biopython2
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

2015 bioinformatics go_hmm_wim_vancriekinge

  • 1.
  • 3.
  • 4.
  • 6. BPC 2015 *** ERGRO *** 1. Longest English word where first three letters are identical to the last three 2. English word where longest stretch of letters are identical at beginning and at the end 3. In Dutch ? 4. Any other language 5. Biological relevance ? Send before 1st of december to wim.vancriekinge@gmail.com Longest one wins, if same size first to submit
  • 7. Dries Godderis 1. Langste engels woord waar 3 eerste letters = 3 laatste letters: antipredeterminant (18) 2. Langste engels woord met langste gelijke stretch = benzeneazobenzene (17) 3. In nederlands langste woord met eerste 3 letters = 3 laatste letters: tentoonstellingsprojecten (25) In nederlands langste woord met langste gelijke stretch = dierentuindieren (16) 4. In portugees langste woord met eerste 3 letters = 3 laatste letters: desconstitucionalizardes (24) In portugees langste woord met langste gelijke stretch = reassenhoreasse (15) (=vervoegd werkwoord van reassenhorear) 5. Biologische relevantie: bij een hairpin loop (stem-loop) model wordt de stabiliteit en vorming van deze structuur bepaalt door de stabiliteit van de helix en de gevormde loopregio's. Een gelijke stretch aan begin en eind van de sequentie zullen een belangrijke rol spelen in een goede basepaarvorming
  • 8. Exams • Dates ? • 1st question
  • 9. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction Composite Gene Prediction Non-coding RNA HMM
  • 10. UNKNOWN PROTEIN SEQUENCE LOOK FOR: • Similar sequences in databases ((PSI) BLAST) • Distinctive patterns/domains associated with function • Functionally important residues • Secondary and tertiary structure • Physical properties (hydrophobicity, IEP etc)
  • 11. BASIC INFORMATION COMES FROM SEQUENCE • One sequence- can get some information eg amino acid properties • More than one sequence- get more info on conserved residues, fold and function • Multiple alignments of related sequences- can build up consensus sequences of known families, domains, motifs or sites. • Sequence alignments can give information on loops, families and function from conserved regions
  • 12. Additional analysis of protein sequences • transmembrane regions • signal sequences • localisation signals • targeting sequences • GPI anchors • glycosylation sites • hydrophobicity • amino acid composition • molecular weight • solvent accessibility • antigenicity
  • 13. FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES • Pattern - short, simplest, but limited • Motif - conserved element of a sequence alignment, usually predictive of structural or functional region To get more information across whole alignment: • Profile • HMM
  • 14. PATTERNS • Small, highly conserved regions • Shown as regular expressions Example: [AG]-x-V-x(2)-x-{YW} – [] shows either amino acid – X is any amino acid – X(2) any amino acid in the next 2 positions – {} shows any amino acid except these BUT- limited to near exact match in small region
  • 15. PROFILES • Table or matrix containing comparison information for aligned sequences • Used to find sequences similar to alignment rather than one sequence • Contains same number of rows as positions in sequences • Row contains score for alignment of position with each residue
  • 16. HIDDEN MARKOV MODELS (HMM) • An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities • Package used HMMER (http://hmmer.wusd.edu/) • Start with one sequence or alignment -HMMbuild, then calibrate with HMMcalibrate, search database with HMM • E-value- number of false matches expected with a certain score • Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD) HMM
  • 18. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 19. What is an ontology? • An ontology is an explicit specification of a conceptualization. • A conceptualization is an abstract, simplified view of the world that we want to represent. • If the specification medium is a formal representation, the ontology defines the vocabulary.
  • 20. Why Create Ontologies? • to enable data exchange among programs • to simplify unification (or translation) of disparate representations • to employ knowledge-based services • to embody the representation of a theory • to facilitate communication among people
  • 21. Summary • Ontologies are what they do: artifacts to help people and their programs communicate, coordinate, collaborate. • Ontologies are essential elements in the technological infrastructure of the Knowledge Age • http://www.geneontology.org/
  • 22. •Molecular Function — elemental activity or task nuclease, DNA binding, transcription factor •Biological Process — broad objective or goal mitosis, signal transduction, metabolism •Cellular Component — location or complex nucleus, ribosome, origin recognition complex The Three Ontologies
  • 23. DAG Structure Directed acyclic graph: each child may have one or more parents
  • 26. Example - Cellular Location
  • 28. GO: Applications • Eg. chip-data analysis: Overrepresented item can provide functional clues • Overrepresentation check: contingency table – Chi-square test (or Fisher is frequency < 5)
  • 29. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 30. Problem: Given a very long DNA sequence, identify coding regions (including intron splice sites) and their predicted protein sequences Computational Gene Finding
  • 32. • There is no (yet known) perfect method for finding genes. All approaches rely on combining various “weak signals” together • Find elements of a gene – coding sequences (exons) – promoters and start signals – poly-A tails and downstream signals • Assemble into a consistent gene model Computational Gene Finding
  • 34.
  • 35. GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP This gene structure corresponds to the position on the physical map
  • 36. GENE STRUCTURE INFORMATION - ACTIVE ZONE This gene structure shows the Active Zone The Active Zone limits the extent of analysis, genefinder & fasta dumps A blue line within the yellow box indicates regions outside of the active zone The active zone is set by entering coordinates in the active zone (yellow box)
  • 37. GENE STRUCTURE INFORMATION - POSITION This gene structure relates to the Position: Change origin of this scale by entering a number in the green 'origin' box
  • 38. GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE This gene structure relates to the predicted gene structures Boxes are Exons, thin lines (or springs) are Introns
  • 39. Find the open reading frames GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT Any sequence has 3 potential reading frames (+1, +2, +3) Its complement also has three potential reading frames (-1, -2, -3) 6 possible reading frames The triplet, non-punctuated nature of the genetic code helps us out 64 potential codons 61 true codons 3 stop codons (TGA, TAA, TAG) Random distribution app. 1/21 codons will be a stop E K A P A Q S E M V S L S F H R K K L L P N L K W L A Y L S T K S S C P I * N G * P I F P P
  • 40. GENE STRUCTURE INFORMATION - OPEN READING FRAMES This gene structure relates to Open reading Frames There is one column for each frame Small horizontal lines represent stop codons
  • 41. They have one column for each frame The size indicates relative score for the particular start site GENE STRUCTURE INFORMATION - START CODONS This gene structure represents Start Codons
  • 42. • Amino acid distributions are biased e.g. p(A) > p(C) • Pairwise distributions also biased e.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)] • Nucleotides that code for preferred amino acids (and AA pairs) occur more frequently in coding regions than in non-coding regions. • Codon biases (per amino acid) • Hexanucleotide distributions that reflect those biases indicate coding regions. Computational Gene Finding: Hexanucleotide frequencies
  • 43. Gene prediction Generation of datasets (Ensmart@Ensembl): Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA): Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regions Distance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp) Do you see a difference in this “distance array” between coding and noncoding sequence ? Could it be used to predict genes ? Write a program to predict genes in the following genomic sequence (http://biobix.ugent.be/txt/genomic.txt) What else could help in finding genes in raw genomic sequences ?
  • 44. GENE STRUCTURE INFORMATION - CODING POTENTIAL This gene structure corresponds to the Coding Potential The grey boxes indicate regions where the codon frequencies match those of known C. elegans genes. the larger the grey box the more this region resembles a C. elegans coding element
  • 45. blastn (EST) For raw DNA sequence analysis blastx is extremely useful Will probe your DNA sequence against the protein database A match (homolog) gives you some ideas regarding function One problem are all of the genome sequences Will get matches to genome databases that are strictly identified by sequence homology – often you need some experimental evidence
  • 46. GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY This feature shows protein sequence similarity The blue boxes indicate regions of sequence which when translated have similarity to previously characterised proteins. To view the alignment, select the right mouse button whilst over the blue box.
  • 47. GENE STRUCTURE INFORMATION - EST MATCHES This gene structure relates to Est Matches The yellow boxes represent DNA matches (Blast) to C. elegans Expressed Sequence Tags (ESTS) To view the alignment use the right mouse button whilst over the yellow box to invoke Blixem
  • 48. Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34 New generation of programs to predict gene coding sequences based on a non-random repeat pattern (eg. Glimmer, GeneMark) – actually pretty good
  • 49. • CpG islands are regions of sequence that have a high proportion of CG dinucleotide pairs (p is a phoshodiester bond linking them) – CpG islands are present in the promoter and exonic regions of approximately 40% of mammalian genes – Other regions of the mammalian genome contain few CpG dinucleotides and these are largely methylated • Definition: sequences of >500 bp with – G+C > 55% – Observed(CpG)/Expected(CpG) > 0.65 Computational Gene Finding
  • 50. GENE STRUCTURE INFORMATION - REPEAT FAMILIES This gene structure corresponds to Repeat Families This column shows matches to members of a number of repeat families Currently a hidden markov model is used to detect these
  • 51. GENE STRUCTURE INFORMATION - REPEATS This gene structure relates to Repeats This column shows regions of localised repeats both tandem and inverted Clicking on the boxes will show the complete repeat information in the blue line at the top end of the screen
  • 53. • Most Eukaryotic introns have a consensus splice signal: GU at the beginning (“donor”), AG at the end (“acceptor”). • Variation does occur in the splice sites • Many AGs and GTs are not splice sites. • Database of experimentally validated human splice sites: http://www.ebi.ac.uk/~thanaraj/splice.h tml Computational Gene Finding: Splice junctions
  • 54. GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES This gene structure shows putative splice sites The Splice Sites are shown 'Hooked' The Hook points in the direction of splicing, therefore 3' splice sites point up and 5' Splice sites point down The colour of the Splice Site indicates the position at which it interrupts the Codon The height of the Splices is proportional to the Genefinder score of the Splice Site
  • 55. Gene Prediction, HMM & ncRNA What to do with an unknown sequence ? Web applications Gene Ontologies Gene Prediction HMM Composite Gene Prediction Non-coding RNA
  • 56.
  • 57. • Recall that profiles are matrices that identify the probability of seeing an amino acid at a particular location in a motif. • What about motifs that allow insertions or deletions (together, called indels)? • Patterns and regular expressions can handle these easily, but profiles are more flexible. • Can indels be integrated into profiles? Towards profiles (PSSM) with indels – insertions and/or deletions
  • 58. • Need a representation that allows specification of the probability of introducing (and/or extending) a gap in the profile. A .1 C .05 D .2 E .08 F .01 Gap A .04 C .1 D .01 E .2 F .02 Gap A .2 C .01 D .05 E .1 F .06 delete continue Hidden Markov Models: Graphical models of sequences
  • 59. • A sequence is said to be Markovian if the probability of the occurrence of an element in a particular position depends only on the previous elements in the sequence. • Order of a Markov chain depends on how many previous elements influence probability – 0th order: uniform probability at every position – 1st order: probability depends only on immediately previous position. • 1st order Markov chains are good for proteins. Hidden Markov Chain
  • 61. Markov chain with begin and end
  • 62. • Consists of states (boxes) and transitions (arcs) labeled with probabilities • States have probability(s) of “emitting” an element of a sequence (or nothing). • Arcs have probability of moving from one state to another. – Sum of probabilities of all out arcs must be 1 – Self-loops (e.g. gap extend) are OK. Markov Models: Graphical models of sequences
  • 63. • Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1, and each transition is equally likely. Example sequences: 1234 234 14 121214 2123334 Begi n Emit 1 Emit 2 Emit 4 Emit 3 End Markov Models
  • 64. • Now, add probabilities to each transition (let emission remain a single element) • We can calculate the probability of any sequence given this model by multiplying 0.5 0.5 0.25 0.75 0.9 0.1 0.2 0.8 1.0Begi n Emit 1 Emit 2 Emit 4 Emit 3 End p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03 p(14) = 0.5 * 0.9 = 0.45 p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06 Hidden Markov Models: Probabilistic Markov Models
  • 65. • If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequence BCCD or BCCD ? 0.5 0.5 0.25 0.75 0.9 0.1 0.2 0.8 1.0Begi n A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End Hidden Markov Models: Probablistic Emmision
  • 66. • Emission uncertainty means the sequence doesn't identify a unique path. The states are “hidden” • Probability of a sequence is sum of all paths that can produce it: 0.5 0.5 0.25 0.75 0.9 0.1 0.2 0.8 1.0Begi n A (0.8) B(0.2) B (0.7) C(0.3) C (0.1) D (0.9) C (0.6) A(0.4) End p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9 + 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9 = 0.000972 + 0.013608 = 0.01458 Hidden Markov Models
  • 68. Hidden Markov Models: The occasionally dishonest casino
  • 69. Hidden Markov Models: The occasionally dishonest casino
  • 70. • The HMM must first be “trained” using a training set – Eg. database of known genes. – Consensus sequences for all signal sensors are needed. – Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors. • Transition probabilities between all connected states must be estimated. • Estimate the probability of sequence s, given model m, P(s|m) – Multiply probabilities along most likely path (or add logs – less numeric error) Use of Hidden Markov Models
  • 71. • HMMs are effectively profiles with gaps, and have applications throughout Bioinformatics • Protein sequence applications: – MSAs and identifying distant homologs E.g. Pfam uses HMMs to define its MSAs – Domain definitions – Used for fold recognition in protein structure prediction • Nucleotide sequence applications: – Models of exons, genes, etc. for gene recognition. Applications of Hidden Markov Models
  • 72. • UC Santa Cruz (David Haussler group) – SAM-02 server. Returns alignments, secondary structure predictions, HMM parameters, etc. etc. – SAM HMM building program (requires free academic license) • Washington U. St. Louis (Sean Eddy group) – Pfam. Large database of precomputed HMM-based alignments of proteins – HMMer, program for building HMMs • Gene finders and other HMMs (more later) Hidden Markov Models Resources
  • 74. HMM in protein analysis • http://www.cse.ucsc.edu/research/compbio/is mb99.handouts/KK185FP.html
  • 75.
  • 76. Hidden Markov model for gene structure • A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene. • A candidate gene structure is created by tracing a path from B to F. • A hidden Markov model (or hidden semi-Markov model) is defined by attaching stochastic models to each of the arcs and nodes. Signals (blue nodes): • begin sequence (B) • start translation (S) • donor splice site (D) • acceptor splice site (A) • stop translation (T) • end sequence (F) Contents (red arcs): • 5’ UTR (J5’) • initial exon (EI) • exon (E) • intron (I) • final exon (EF) • single exon (ES) • 3’ UTR (J3’)
  • 77. Classic Programs for gene finding Some of the best programs are HMM based: • GenScan – http://genes.mit.edu/GENSCAN.html • GeneMark – http://opal.biology.gatech.edu/GeneMark/ Other programs • AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3, GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound
  • 78. GENSCAN not to be confused with GeneScan, a commercial product • A Semi-Markov Model – Explicit model of how long to stay in a state (rather than just self-loops, which must be exponentially decaying) • Tracks “phase” of exon or intron (0 coincides with codon boundary, or 1 or 2) • Tracks strand (and direction) Hidden Markov Models: Gene Finding Software
  • 79. Conservation of Gene Features Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another. 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% aligning identity
  • 80. Composite Approaches • Use EST info to constrain HMMs (Genie) • Use protein homology info on top of HMMs (fgenesh++, GenomeScan) • Use cross species genomic alignments on top of HMMs (twinscan, fgenesh2, SLAM, SGP)
  • 81. Gene Prediction: more complex … 1. Species specific 2. Splicing enhancers found in coding regions 3. Trans-splicing 4. …
  • 82. Length preference 5’ ss intcomp branch 3’ ss
  • 83.
  • 84. Contents-Schedule RNA genes Besides the 6000 protein coding-genes, there is: 140 ribosomal RNA genes 275 transfer RNA gnes 40 small nuclear RNA genes >100 small nucleolar genes ? pRNA in 29 rotary packaging motor (Simpson et el. Nature 408:745-750,2000) Cartilage-hair hypoplasmia mapped to an RNA (Ridanpoa et al. Cell 104:195-203,2001) The human Prader-Willi ciritical region (Cavaille et al. PNAS 97:14035-7, 2000)
  • 85.
  • 86.
  • 87.
  • 88.
  • 89. RNA genes can be hard to detects UGAGGUAGUAGGUUGUAUAGU C.elegans let-27; 21 nt (Pasquinelli et al. Nature 408:86-89,2000) Often small Sometimes multicopy and redundant Often not polyadenylated (not represented in ESTs) Immune to frameshift and nonsense mutations No open reading frame, no codon bias Often evolving rapidly in primary sequence miRNA genes
  • 90. • Lin-4 identified in a screen for mutations that affect timing and sequence of postembryonic development in C.elegans. Mutants re- iterate L1 instead of later stages of development • Gene positionally cloned by isolating a 693-bp DNA fragment that can rescue the phenotype of mutant animals • No protein found but 61-nucleotide precursor RNA with stem-loop structure which is processed to 22-mer ncRNA • Genetically lin-4 acts as negative regulator of lin-14 and lin-28 • The 3’ UTR of the target genes have short stretches of complementarity to lin-4 • Deletion of these lin-4 target seq causes unregulated gof phenotype • Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins although the target mRNA Lin-4
  • 91. Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21- nucleotide product The small let-7 RNA is also thought to be a post-transcriptional negative regulator for lin-41 and lin-42 100% conserved in all bilaterally symmetrical animals (not jellyfish and sponges) Sometimes called stRNAs, small temporal RNAs Let-7 (Pasquinelli et al. Nature 408:86-89,2000)
  • 92.
  • 93. Two computational analysis problems • Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat) – For RNA, you want to take the secondary structure of the query into account • Genefinding. Based solely on a priori knowledge of what a “gene” looks like, find genes in a genome sequence – For RNA, with no open reading frame and no codon bias, what do you look for ?
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS
  • 101. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS
  • 102. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS
  • 103. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS
  • 104. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS
  • 105. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc
  • 106. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc
  • 107. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  • 108. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc
  • 109. Basic CFG “production rules” S -> aS S -> Sa S -> aSu S -> SS Context-free grammers A CFG “derivation” S -> aS S -> aaS S -> aaSS S -> aagScuS S -> aagaSucugSc S -> aagaSaucuggScc S -> aagacSgaucuggcgSccc S -> aagacuSgaucuggcgSccc S -> aagacuuSgaucuggcgaSccc S -> aagacuucSgaucuggcgacSccc S -> aagacuucgSgaucuggcgacaSccc S -> aagacuucggaucuggcgacaccc A C G U * A A A A A G G G G G C C C C CCC U U U * * * * *
  • 110.
  • 111.
  • 112. The power of comparative analysis • Comparative genome analysis is an indispensable means of inferring whether a locus produces a ncRNA as opposed to encoding a protein. • For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species. • It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions. • With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.
  • 113. Compensatory substitutions that maintain the structure U U C G U A A U G C A UCGAC 3’ G C 5’
  • 114. Evolutionary conservation of RNA molecules can be revealed by identification of compensatory substitutions
  • 116. • Manual annotation of 60,770 full-length mouse complementary DNA sequences, clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database. • Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome.
  • 119. Therapeutic Applications • Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar) • In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.