Outline
 What is Bioinformatics?
 Basic molecular biology
 Public databases
 Sequence analysis
 The scales of bioinformatics
 Biological data mining
What is Bioinformatics?
 Several definitions exist. Michael Liebman proposed a
quite elegant definition:
 “The study of the information content and information flow in
biological systems and processes” (Michael Liebman)
 Information content: genome project
 Information flow in biological systems: molecular transport
 Biological systems: cells, organisms, …
 Biological processes: metabolic networks
 Bioinformatics is the science of using information to
understand aspects of Biology. That is, a discipline where
techniques such as applied mathematics, computer science,
statistics, artificial intelligence, etc. are integrated to solve
biological problems
Information, information, information
 As we know there have been major advances in the
field of molecular biology
 These have been coupled with advances in
laboratory (post)genomic technology
 This has led to an explosive growth in the
collection of biological information
 This deluge of information has led to an absolute
requirement for
1. Computerized databases to store, organize and index the
data
2. For specialized tools to view and analyze the data
3. Specialized tools to infer new knowledge from the data
Areas of research(taxonomy of the
Bioinformatics Journal)
 Genome Analysis
 Sequence Analysis
 Phylogenetics
 Structural Bioinformatics
 Gene Expression
 Genetics and Population Analysis
 Systems Biology
 Data and Text Mining
 Databases
 Bioimage Informatics
Life begins with Cell
 A cell is the smallest structural unit of an organism that is capable of
sustained independent functioning
 All cells have some common features
 What is Life? Can we create it in the lab? Read:
The imitation game—a computational chemical approach to
recognizing life. Nature Biotechnology, 24:1203-1206, 2006
2 types of cells:
Prokaryotes & Eukaryotes
Example of cell signaling
Terminology
 The genome is an organism’s complete set of DNA.
 a bacteria contains about 600,000 DNA base pairs
 human and mouse genomes have some 3 billion.
 human genome has 23 distinct chromosomes.
 Each chromosome contains many genes.
 Gene
 basic physical and functional units of heredity.
 specific sequences of DNA bases that encode
instructions on how and when to make proteins.
 Proteins
 Make up the cellular structure
 large, complex molecules made up of smaller subunits
called amino acids.
All Life depends on 3 critical molecules
 DNAs
 Hold information on how cell works
 RNAs
 Act to transfer short pieces of information to different parts of cell
 Provide templates to synthesize into protein
 Proteins
 Form enzymes that send signals to other cells and regulate gene
activity
 Form body’s major components (e.g. hair, skin, etc.)
 Are life’s laborers!
 Computationally, all three can be represented as
sequences of a certain 4-letter (DNA/RNA) or 20-letter
(Proteins) alphabet
DNA, RNA, and the Flow of Information
TranslationTranscription
Replication
Weismann
Barrier /
Central
Dogma of
Molecular
Biology
Overview of DNA to RNA to Protein
 A gene is expressed in two steps
1) Transcription: RNA synthesis
2) Translation: Protein synthesis
DNA: The Basis of Life
 Deoxyribonucleic Acid (DNA)
 Double stranded with complementary strands A-T, C-G
 DNA is a polymer
 Sugar-Phosphate-Base
 Bases held together by H bonding to the opposite strand
RNA
 RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by
U(racil)
 Some forms of RNA can form secondary
structures by“pairing up” with itself. This can
have impact on its properties dramatically.
DNA and RNA
can pair with
each other.http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:
RNA, continued
Several types exist, classified by function:
 hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary
transcipts with introns that have not yet been excised (pre-mRNA).
 mRNA: this is what is usually being referred to when a
Bioinformatician says “RNA”. This is used to carry a gene’s
message out of the nucleus.
 tRNA: transfers genetic information from mRNA to an amino acid
sequence as to build a protein
 rRNA: ribosomal RNA. Part of the ribosome which is involved in
translation.
Transcription Transcription is highly regulated. Most DNA is in a
dense form where it cannot be transcribed.
 To start, transcription requires a promoter, a small
specific sequence of DNA to which polymerase can
bind (~40 base pairs “upstream” of gene)
 Finding these promoter regions is only a partially
solved problem that is related to motif finding.
 There can also be repressors and inhibitors acting in
various ways to stop transcription. This makes
regulation of gene transcription complex to
understand.
Definition of a Gene
 Regulatory regions: up to 50 kb upstream of +1 site
 Exons: protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)
8 bp to 17 kb per exon (mean 145 bp)
 Introns: splice acceptor and donor sites, junk DNA
average 1 kb – 50 kb per intron
 Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Splicing
Splicing and other RNA processing
 In Eukaryotic cells, RNA is processed between
transcription and translation.
 This complicates the relationship between a DNA
gene and the protein it codes for.
 Sometimes alternate RNA processing can lead to an
alternate protein (splice variants) as a result. This is
true in the immune system.
Proteins: Crucial molecules
for the functioning of life
• Structural Proteins: the organism's basic building blocks, eg. collagen,
nails, hair, etc.
• Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a single type
of reaction, but they can play a role in more than one pathway.
• Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small molecules from
the extracellular environment and generation of ionic gradients essential for
muscle and nerve cell function (sodium/potasium pump is an example)
• Proteins are polypeptide chains, constructed by joining a certain kind of
peptides, amino acids, in a linear way
• The chain of amino acids, however folds to create very complex 3D
structures
Translation
 The process of going
from RNA to
polypeptide.
 Three base pairs of
RNA (called a codon)
correspond to one
amino acid based on
a fixed table.
 Always starts with
Methionine and ends
with a stop codon
Amino Acids
Protein Structure: Introduction
 Different amino acids
have different properties
 These properties will
affect the protein
structure and function
 Hydrophobicity, for
instance, is the main
driving force (but not
the only one) of the
folding process
Protein Structure: Hierarchical nature of protein
structure
MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL
PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE
KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK
HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL
IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE
Primary Structure = Sequence of amino acids
Secondary Structure Tertiary
Local Interactions Global Interactions
Protein Structure: Why is structure
important?
 The function of a protein depends greatly on its
structure
 The structure that a protein adopts is vital to it’s
chemistry
 Its structure determines which of its amino acids are
exposed to carry out the protein’s function
 Its structure also determines what substrates it can
react with
Protein Structure: Mostly lacking
information
 Therefore, it is clear that knowing the structure of a
protein is crucial for many tasks
 However, we only know the structure for a very small
fraction of all the proteins that we are aware of
 The UniProtKB/TrEMBL archive contains 23165610
(16886838) sequences
 The PDB archive of protein structure contains only
84223(76669) structures
 In the native state, proteins fold on its own as soon as
they are generated, amino-acid by amino-acid (with
few exceptions e.g. chaperones)  can we predict this
process as to close the gap between protein sequences
and their 3D structures?
Central Dogma of Biology: A Bioinformatics
Perspective
The information for making proteins is stored in DNA. There is
a process (transcription and translation) by which DNA is
converted to protein. By understanding this process and how it
is regulated we can make predictions and models of cells.
Sequence analysis
Gene Finding
Protein
Sequence/Stru
cture Analysis
Assembly
Computational Problems
Information flow in bioinformatics
 Data enters the “bioinformatics scope” when a scientist
deposits an experimental result in an appropriate archive
 The archive curates and annotates the data
 The data is released to the public
 Afterwards, the data may be retrieved/analysed:
 Integrating the new entry into a search engine
 Extracting useful subsets of the data
 Deriving new types of information from the data
 Aggregating the data, by homology, function, structure
 Reannotating the data with new discovered/inferred info.
 Quality of data depends on many factors, the techniques used
to experimentally create the data, degree of inference and
prediction involved in the annotation process, etc.
 Many publicly available databases:
http://en.wikipedia.org/wiki/List_of_biological_databases
NCBI’s Entrez system
http://www.ncbi.nlm.nih.gov/
Entrez is a search and retrieval system that integrates
information from databases at NCBI (National Center for
Biotechnology Information).
Uniprot http://www.uniprot.org
 The Universal Protein Resource (UniProt) is a collaboration between
the European Bioinformatics Institute (EBI), the SIB Swiss Institute of
Bioinformatics and the Protein Information Resource (PIR)
KEGG - http://www.genome.jp/kegg/
 Not just about
genes/proteins but
also pathways, that is,
their interactions
DAVID - http://david.abcc.ncifcrf.gov/
Sequences
 Be it DNA, RNA or proteins we have many data that
can be represented as sequences of a certain alphabet
 Many generic algorithms to deal with biological
sequences exist
 Sequence alignment
 Motif representation
Sequence Alignment Is the assignment of residue-residue correspondences
between nucleotide/proteomic sequences
Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY
Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL
Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107
...
Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ + C P+
Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335
Query 361 S-PSVN 365
P VN
Sbjct 336 KFPGVN 341
gap
matches
mismatches
Motivation
 Similarity is expected among biomolecules that are
descended from a common ancestor.
 Mutations cause differences, but survival of the organism requires that
mutations occur in regions that are less critical to function
 Important catalytic, regulatory or structural regions remain similar
 An alignment between two or more genetic or proteomic
sequences represents an explicit hypothesis via their
evolutionary histories.
 Thus comparison of related gene/protein sequences have
been instrumental in shedding light into the information
content of these sequences and their biological functions.
Definition and aims
 Why align sequences?
1. Start with a query sequence with unknown properties
and search within a database of millions of sequences to
find those which share similarity with the query.
2. Start with a small set of sequences and identify
similarities and differences among them.
3. In many sequences or very long sequences, detect
commonly occurring patterns
Similarity vs. Homology
 Similarity is the observation or measurement of resemblance
and difference, independent of the source of resemblance.
 There are many examples of different organisms with
functionally similar organs that came from distinc evolutionary
origins
 When similarity is due to a common ancestry, we call it
homology.
 Sequence alignment helps inferring homology hypothesis:
 If two sequences are very similar, it is probable that there is a common
origin
 Therefore, if we know some information (structure, function) from
sequence X, and sequence X is similar to sequence Y, it is probable that
the same information applies to Y
Metrics of similarity: Definitions
 Gap: a break in the alignment, in either one of the
sequences.
 For nucleotides, a consequence of an insertion or deletion
mutation.
 For proteins, it’s more difficult to say.
 Regions of matching residues.
 Indicate parts of a sequence that are well conserved
 Mismatched residues.
 For nucleotides, a consequence of a substitution mutation
 Less conserved regions
Metrics of similarity: Distance scoring
 Distance scoring
 Given an alignment with matches, mismatches and
gaps, we compute a score following:
 For each mismatch, score is increased by 2
 For each gap, score is increased by 4
 For each match, no increase in score
 Higher score, less similarity
 Equivalent metrics exist for similarity (not
distance) where higher score means good
similarity
= 18
A – G C C G T A T
A C G A - - T - T
0 4 0 2 4 4 0 4 0
Metrics of similarity: Mismatches and gaps
 Are all mismatches equally bad?
 For protein sequences, there are several subgroups of amino
acids with similar properties. Mismatches within a group have
less impact
 For nucleotide sequences, transition mutations (a↔g and t↔c)
are more common than transversions (a or g ↔ t or c) mutations
 Distance scoring of mismatches could be smarter  substitution
matrices
 Using statisical analysis on large corpus of real sequences to generate better
scores
 How to penalize gaps
 Each gap slot gets equal distance score
 One score to open a gap, another (smaller) score to extend the
same gap
Global vs Local alignment
 We know how to score good or bad alignments
 How to find the optimal one?
 Two classes of alignment methods
 Global alignment
 Finds the best alignment of one entire sequence with another
entire sequence
 Local alignment
 Find the best alignment of one segment of a sequence against
another segment of another sequence
Exact vs. Approximate methods
 Exact methods for both global and local alignment exist, based
on dynamic programming, but are slow
 Good enough when there are few sequences
 Not so good when comparing a target sequence to a database of
millions of known sequences
 Approximate methods have been used for many years for large-
scale alignment tasks
 They use some kind of heuristic to speed up the alignment process
 BLAST (Basic Local Alignment Search Tool) is the most famous
approximate method
 It identifies potential hits by looking for perfect matches of very small sub-sequences
(seeds)
 It only tries to create a full alignment for sequences where several seeds are identified
 PSI-BLAST: version that takes into account that multiple hits are identified. It
constructs a tailored substitution matrix based on hits and then refines the alignment
Multiple Sequence Alignment
 When we have to align more than two sequences
 Progressive methods (e.g. ClustalW)
 Start with seed alignment
 Iteratively incorporate other alignments to seed, without
modifying what is aligned so far
 ClustalW uses phylogenetic trees (representations of the
evolutionary relationship between sequences) to
progressively construct MSA
 Iterative methods (e.g. MUSCLE)
 Can re-edit the partial MSA based on the newly
incorporated alignments
ClustalW
Interface
in Uniprot
Motifs
 When visualising a MSA we can see regions of high
agreement and regions of low agreement.
 The high agreement regions define that a certain
protein belongs to a family
 What if we concentrate on modelling and identifying
these regions instead of the whole sequences  Motif
finding
DNA -> RNA -> Proteins
DNA
 Coding/non coding
 SNPs
 Copy number variation
 Assembly
 Methylation
 Primer design
Coding/Non Coding
 Identifying the regions from an organism’s genome
that contain genes
 Many different factors involved in this identification
 Promoter identification
 Long enough Open Reading Frames (ORF)
 Splice variants
 Introns/Exons (in Eucaryotes)
 Statistical properties of gene-coding DNA
 HMM are also used for gene finding
Single Nucleotide Polymorphisms
(SNPs)
 One base-pair variation in DNA
 In most cases in non-coding regions of DNA, but not
always
 When frequent enough in a population they can be
linked to specific traits, e.g. a disease
 SNP microarrays can be used to probe hundreds of
thousands of SNPs in parallel
 In reality few SNPs act on their own
 Genome-Wide Association Studies identify groups of
SNPs linked to a certain condition
Copy Number Variation
 In general two copies of each gene exist in a genome
 It may be the cases than more/less than two copies
exist of a certain gene for a specific sub-population
 It has been suggested that certain CNV can be linked
to specific diseases
Genome assembly
 Sequencing technologies are able to read (sequence) a
complete genome as a series of short overlapping
fragments
 How to assemble back all these fragments?
 Greedy approach
 Pair-wise alignments of all fragments
 Merge fragments of largest overlap
 Keep iterating until all segments are merged
 Worked more or less well on old sequencing technologies,
not so well on next-generation sequencing data, due to
smaller fragment sizes and larger error rate
Genome mapping
 Given a large set of short fragments, as a result of next-
generation sequencing, map them to a reference
genome
 Different from previous one. We do not want to
reconstitute a complete genome, just identify to which
genes each fragment belongs (among other
applications).
 Speed is an issue
 Modern methods (e.g. SOAP2) compress the genome
and are able to align the fragments in the compressed
space
Methylation
 It is a chemical reaction that can block a certain region
of a chromosome, preventing its transcription
 The process can be reverted, so essentially it is an
on/off switch of the affected gene
 Specialised microarrays exist for the high-throughput
detection of methylated genes
 Afterwards, data analysis can take place
DNA library specification
• A DNA library is a combinatorial set of DNA sequences suited to
manufacture via DNA reuse
• The first stage towards the creation of a DNA library is the formal
specification of the target DNA molecules that comprise it
• A set of sequences does not convey the intention behind the library
Key challenge is to enable precise
editing of DNA sequences in an
extensible and reproducible
manner whilst avoiding manual
handling of these unwieldy
objects
DNALD library format
 A DNALD library consists of three sets of definitions:
inputs, intermediates and outputs, with different
semantics
 Inputs: existing DNA sequences to be provided with design
 Intermediates: conceptual means of factoring commons seqs
 Outputs: to be produced through DNA reuse
DNALD expressions
 A DNALD expression is a combination of explicit sequences,
definition names, operators and functions that are interpreted
according to rules of precedence and association ("evaluated") to
produce a set of DNA sequences.
 Definitions bind names to the results of expressions.
Workbench interface
text editor with:
• syntax highlighting
• auto-completion
• code folding
• etc.
manage
projects
viewed from different
perspectives
CADMAD’s DNALD (DNA Library
Design)
A specification language that
produces a set of target DNA
sequences as a function of
operations on a set of inputs
To maximise CADMAD's impact the specification process must be:
 user friendly and debuggable
 but expressively powerful enough to:
 define non-trivial combinatorial constructs
 communicate degrees of freedom
>Ret_human
GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA
CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT
GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC
CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC
TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC
CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC
TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA
CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT
GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG
GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC
GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA
GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC
TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse
GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA
CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT
GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC
AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC
TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA
AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC
TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG
ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA
CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT
ACTGAAGAAGTAGGCT
>Ret_zebrafish
GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT
CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT
GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT
TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT
CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT
CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT
ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT
TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC
GAAGCAGATTCACTGAAAACGGATG
>Ret_chicken
GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG
CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG
TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA
GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT
CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC
TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT
GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC
TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC
AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG
CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG
AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA
GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC
RNA
 Expression
 Structure prediction
RNA expression
 Not all genes are transcribed/translated into proteins all
the time
 The expression of genes is highly sophisticated and
depends on many factors
 Identifying the genes being expressed in a given point of
time in a specific tissue provides crucial information about
the roles and interactions of such genes
 Compare the genes expressed between different groups of
samples to identify those that are differentially expressed
 Identify co-expressed genes, that present patterns of
correlation
Measuring RNA expression
 RT-PCR (Real-time reverse polimerase chain reaction)
 Measures accurately the expression of a pre-determined
gene
 RNA Microarrays
 Measures, in parallel, the expression of tens of
thousands of genes
 RNA-Seq
 The next-generation sequencing variant for measuring
gene expresison
RNA Structure prediction
 A RNA sequence can bind with itself to create complex
shapes with a certain pattern of loops
 Can we predict, from a given sequence, the structural
shape of the RNA?
Proteins
 Protein classification
 Structure prediction
 Structure comparison
 Function and interaction
Protein classification
 Proteins can be annotated in many different ways
 Function
 DNA-binding? Enzyme?
 Tissue/Cellular/Sub-cellular localisation
 Interacting with other proteins?
 Can we predict this annotation using ML?
 We need to transform the protein sequence into a uniform
representation of equal size for all proteins
 Many different representations exist
 Several of these problems can be modelled as a hierarchical
classification problem
Protein Structure Prediction PSP aims to predict the 3D structure of a protein
based on its primary sequence
Protein Structure Prediction
 PSP is an open problem. The 3D structure
depends on many variables
 It has been one of the main holy grails of
computational biology for many decades
 Impact of having better protein structure models
are countless
 Genetic therapy
 Synthesis of drugs for incurable diseases
 Improved crops
 Environmental remediation
Prediction types of PSP There are several kinds of prediction problems within
the scope of PSP
 The main one, of course, is to predict the 3D coordinates of
all atoms of a protein (or at least the backbone) based on
its primary sequence
 There are many structural properties of individual residues
within a protein that can be predicted, for instance:
 The secondary structure state of the residue
 If a residue is buried in the core of the protein or exposed in the
surface
 Accurate predictions of these sub-problems can simplify the
general 3D PSP problem
3D Protein Structure Prediction
 Some PSP methods try to find similar proteins and then
adapt the structure of the homolog (template) to the
target protein  Homology Modeling
 Other methods try to find the structure of the protein
from scratch (Ab Initio Modelling), optimizing some
energy function that models the stability of the protein,
in case no homolog can be identified
 In between there are other kind of methods, for varying
degrees of good homology of our target, for instance,
Fold Recognition or Threading
• These methods identify a target based on more than
homology (i.e. sequence alignment).
Coordination Number PredictionTwo residues of a chain are said to be in contact if their
distance is less than a certain threshold (e.g. 8Å)
CN of a residue : count of contacts that a certain
residue has
CN gives us a simplified profile of the density of packing
of the protein
ContactPrimary
Sequence
Native State
Contact Map prediction Prediction, given two residues
from a chain, whether these two
residues are in contact or not
 This problem can be represented
by a binary matrix. 1= contact, 0 =
non contact
 Plotting this matrix reveals many
characteristics from the protein
structure
 Very sparse characteristic: Less
than 2% of contacts in native
structures
helices sheets
Other predictions Other kinds of residue
structural aspects that can be
predicted
 Solvent accessibility: Amount of
surface of each residue that is
exposed to solvent
 Recursive Convex Hull: A metric
that models a protein as an
onion, and assigns each residue
to a layer. Formally, each layer is
a convex hull of points
 These features (and
others) are predicted in a
similar was as done for SS
Protein Structure Comparison
Protein Structure Comparison
 Protein Structure Comparison (PSC) aims at
 Assess the degree of similarity between protein structures
 Given a query structure, identify other proteins with similar
structure
 Why?
 Group proteins by structural similarities
 Determine the impact of individual residues on the protein
structure
 Identify distant homologues of protein families
 Predict function of proteins with low degree of primary structure
(i.e.. sequence) similarity with other proteins
 Engineer new proteins for specific functions
 Assess ab-initio predictions
Protein Structure Comparison
 Sequence-Structure-Function relationships
1) Conserved 1º sequences similar structures
2) Similar structures conserved 1º sequences
3) Similar structures conserved function
 PSC shares many similarities with sequence alignment.
Our aim is to infer new knowledge from the
comparison process
?
Protein Structure Comparison Existing Approaches
 SSAP (Orengo & Taylor, 96)
 ProSup (Feng & Sippl, 96)
 DALI (Holm & Sander, 93)
 CE (Shindyalov & Bourne, 98)
 LGA (Zemla, 2003)
 SCOP (Murzin, Brenner, Hubbard & Chothia, 95)
 CATH (Orengo, Mithie, Jones, Jones, Swindells &
Thornton, 97)
 ProCKSI – Consensus of multiple PSC methods
Prediction of Protein Function
 In an ideal world, the cascade of inference should flow
from sequence  structure  function
 That is, if we can identify similar sequences of structures
to our query target we can (at varying degrees of
certainty) infer that they have similar function
Prediction of Protein Function
 As proteins evolve, they may
 Retain function and specificity
 Retain function but alter specificity
 Change to a related function, or a similar function in a
different metabolic contxt
 Change to a completely unrelated function
 How much must a protein change before the
function changes?
 Sometimes, not at all. There are many cases of
proteins with different functions in different
environments
Prediction of Protein Function
 Thus, sequence or structure similarity is not always
reliable to assign function
 Other ways of determining protein function
 By identifying patterns of co-regulated genes
 Using data from Microarray experiments
 By identifying protein-protein interactions
Prediction of Protein Function
 A related question is: where is the function of a protein
taking place?  active site
 Several methods exist to predict active/binding sites of
proteins from local patterns of sequence or structure
 A raw way of doing this prediction is to take a look at the
conserved residues of a sequence  they may be related
to either the core of the protein (structural stability) or
the function of a protein (a change of function is a risk for
survival)
 More sophisticated methods exists to learn how to
predict active sites. They use ML, in a similar way used to
predict residue structural features in PSP
 Still, it is a very tough problem, and ML methods are not
much better than blast-based methods
Three case studies
 Mining –omics data
 Predicting structural aspects of protein residues
 Automated alphabet reduction for protein datasets
 In all these three case studies we use the same
evolutionary learning system: BioHEL [Bacardit et al.,
09]
BioHEL BioHEL [Bacardit et al., 09] is an evolutionary
learning system that applies the Iterative Rule
Learning (IRL) approach
 Designed explicitly to deal with noisy large-scale
datasets
 IRL was first used in EC by the SIA system
[Venturini, 93]
BioHEL’s learning paradigm IRL has been used for many years in the ML community,
with the name of separate-and-conquer
BioHEL’s objective function An objective function based on the Minimum-
Description-Length (MDL) (Rissanen,1978) principle
that tries to promote rules with
 High accuracy: not making mistakes
 High coverage: covering as much examples as possible
without sacrificing accuracy. Recall (TP/(TP+FN)) will be
used to define coverage
 Low complexity: rules as simple and general as possible
 The objective function is a linear combination of the three
objectives above
BioHEL’s objective function
 Intuitively, we would like to have accurate rules covering
as much examples as possible.
 However, in complex and inconsistent domains it is rare
to obtain such rules
 In these cases, easier path for evolutionary search is to
maximize accuracy at the expense of coverage
 Therefore, we need to enforce that the evolved rules cover
enough examples
BioHEL’s objective function
 Three parameters define the shape of the function
 The choice of the coverage break is crucial for the proper performance of
the system
 Also, coverage term penalizes rules that do not cover a minimum
percentage of examples or that cover too many
BioHEL’s characteristics Attribute list rule representation
 Automatically identifying the relevant attributes for a given rule and
discarding all the other ones
 The ILAS windowing scheme
 Efficiency enhancement method, not all training points are used for
each fitness computation
 An explicit default rule mechanism
 Generating more compact rule sets
 Iterative process terminates when it is impossible to evolve a rule
where the associated class is the majority class among the matched
examples
 At this point, all remaining training instances are assigned to the default
class
Mining –omics data
 Biological data can be generated at many different
levels
 Genomics (DNA)
 Transcriptomics (RNA)
 Proteomics (proteins)
 Metabolomics (small compounds)
 Lipidomics (lipids)
 Hundreds of –omics have been catalogued
How an –omics dataset looks like?
 In most cases datasets present a similar structure
 Each sample is characteristed by a large number of
variables (RNA, Proteins, lipids, etc.)
 Each variable indicates (usually quantitatively) the
presence of that element in the sample
 Due to the high cost of most –omics technologies,
variables >> samples
 Problems of over-fitting
What can we do with the dataset?
 In most cases, samples are annotated with a
qualitative label
 Cancer/Non-cancer patients
 Samples of seed tissue for which it is known if the seed
germinated or not
 Age of the sample
 Therefore, we can treat these datasets as
classification problems, and generate prediction
models from the data
 Not just as classification problems
 Clustering/Biclustering
 Association Rule Mining
 Regression
But in most cases, domain experts
are not (only) interested in
predictions
 Biomarker identification
 Identify the key variables
 Most strongly associated to each outcome
 Using e.g. t-tests to identify those
 Presenting higher prediction capacity
 As identified by ML methods
 Identify interactions between variables
 By presenting very high (anti)correlation between them
 By acting together to generate predictions
Functional Network Reconstruction
for seed germination Microarray data obtained from seed tissue of Arabidopsis
Thaliana
 122 samples represented by the expression level of
almost 14000 genes
 It had been experimentally determined whether each of
the seeds had germinated or not
 Can we learn to predict germination/dormancy from the
microarray data?
 [Bassel et al., 2011]
Generating rule sets
 BioHEL was able to predict the
outcome of the samples with
93.5% accuracy (10 x 10-fold cross-
validation
 Learning from a scrambled dataset
(labels randomly assigned to
samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict
germination
If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict
germination
If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict
germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and
At1g48320>56.80  Predict germination
Everything else  Predict dormancy
Identifying regulators
 Rule building process is stochastic
 Generates different rule sets each time the system is run
 But if we run the system many times, we can see some
patterns in the rule sets
 Genes appearing quite more frequent than the rest
 Some associated to dormancy
 Some associated to germination
Known regulators appear with high
frequency in the rules
Generating co-prediction networks of
interactions
• For each of the rules shown before to be
true, all of the conditions in it need to be
true at the same time
– Each rule is expressing an interaction between
certain gens
• From a high number of rule sets we can
identify pairs of genes that co-occur with
high frequency and generate functional
networks
• The network shows different topology
when compared to other type of network
construction methods (e.g. by gene co-
expression)
• Different regions in the network contain
the germination and dormancy genes
Experimental validation
 We have experimentally verified this analysis
 By ordering and planting knockouts for the highly ranked
genes
 We have been able to identify four new regulators of
germination, with different phenotype from the wild type
Prediction of structural aspects of protein
residues
 Many of these features are due to local interactions of an amino
acid and its immediate neighbours
 Can it be predicted using information from the closest
neighbours in the chain?
 In this simplified example to predict the SS state of residue i we
would use information from residues i-1 i and i+1. That is a
window of ±1 residues around the target
Ri
SSi
Ri+1
SSi+1
Ri-1
SSi-1
Ri+2
SSi+2
Ri-2
SSi-2
Ri+3
SSi+3
Ri+4
SSi+4
Ri-3
SSi-3
Ri-4
SSi-4
Ri-5
SSi-5
Ri+5
SSi+5
Ri-1 Ri Ri+1  SSi
Ri Ri+1 Ri+2  SSi+1
Ri+1 Ri+2 Ri+3  SSi+2
ARFF file for a simple PSP dataset
@relation AA+CN_Q2
@attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
@attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute class {0,1}
@data
X,X,X,X,A,E,I,K,H,0
X,X,X,A,E,I,K,H,Y,0
X,X,A,E,I,K,H,Y,Q,0
X,A,E,I,K,H,Y,Q,F,0
A,E,I,K,H,Y,Q,F,N,0
E,I,K,H,Y,Q,F,N,V,0
I,K,H,Y,Q,F,N,V,V,0
K,H,Y,Q,F,N,V,V,M,1
H,Y,Q,F,N,V,V,M,T,0
Y,Q,F,N,V,V,M,T,C,1
What information do we include for each
residue?
 Early prediction methods used just the primary sequence
 the AA types of the residues in the window
 However the primary sequence has limited amount of
information
 It does not contain any evolutionary information it does not
say which residues are conserved and which are not
 Where can we obtain this information?
 Position-Specific Scoring Matrices which is a product of a
Multiple Sequence Alignment
Position-Specific Scoring Matrices (PSSM)
– For each residue in the query sequence compute
the distribution of amino acids of the corresponding
residues in all aligned sequences (discarding those
too similar to the query)
– This distributions will tell us which mutations are
likely and which mutations are less likely for each
residue in the query sequence
– In essence it’s similar to a substitution matrix but
tailored for the sequence that we are aligning
– A PSSM profile will also tell us which residues are
more conserved and which residues are more
subject to insertions or deletions
PSSM for the 10 first residues of 1n7lAA R N D C Q E G H I L K M F P S T W Y V
A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0
M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1
E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3
K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3
V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5
Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3
Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2
L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1
T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
Secondary Structure Prediction
– The most usual way is to predict whether a
residue belongs to an α helix a β sheet or is in coil
state
– Several programs can determine the actual SS
state of a protein from a PDB file. The most
common of them is DSSP
– Typically, a window of ±7 amino acids (15 in total)
is used. This means 300 attributes (when using
PSSM).
– A dataset with 1000 proteins with
~250AA/protein would have ~250000 instances
Secondary Structure Prediction
R1 R2 R3 Rn-1 Rn
Primary sequence
MSA
PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn
PSSM profile of sequence
Windows
generation
PSSMi-1 PSSMi PSSMi+1
Prediction
method
SSi?
Window of PSSM profilesPrediction
Other prediction problems
 This same structure of prediction can be applied to most
1D structural aspects
 However, many of these features are natively continuous
measures (or integer)
 To treat these problems as classification problems, we
need to discretise the output
 Unsupervised methods are applied
 Uniform length and uniform frequency disc.
UL
UF
PSP datasets are good ML benchmarks
 These problems can be modelled in may ways:
 Regression or classification problems
 Low/high number of classes
 Balanced/unbalanced classes
 Adjustable number of attributes
 Ideal benchmarks !!
 http://icos.cs.nott.ac.uk/datasets/psp_benchmark.ht
ml
Contact Map Prediction
 We participated in the CASP9 competition
 CASP = Critical Assessment of Techniques for Protein Structure Prediction.
Biannual competition
 Every day, for about three months, the organizers release some protein
sequences for which nobody knows the structure (129 sequences were
released in CASP9, in 2010)
 Each prediction group is given three weeks to return their predictions
 If the machinery is not well oiled, it is not feasible to participate !!
 For CM, prediction groups have to return a list of predicted contacts (they
are not interested in non-contacts) and, for each predicted pair of
contacting residues, a confidence level
Contact Map prediction Prediction given two residues
from a chain whether these
two residues are in contact or
not
 This problem can be
represented by a binary matrix.
1= contact 0 = non contact
 Plotting this matrix reveals
many characteristics from the
protein structure
helices sheets
Steps for CM prediction (Nottingham
method)
1. Prediction of
 Secondary structure (using PSIPRED)
 Solvent Accessibility
 Recursive Convex Hull
 Coordination Number
2. Integration of all these predictions plus other sources of
information
3. Final CM prediction (using BioHEL)
Using BioHEL [Bacardit et al., 09]
Prediction of RCH, SA and CN
 We selected a set of 3262 protein chains from PDB-
REPRDB with:
 A resolution less than 2Å
 Less than 30% sequence identify
 Without chain breaks nor non-standard residues
 90% of this set was used for training (~490000 residues)
 10% for test
Prediction of RCH, SA and CN
 All three features were predicted based on a window of
±4 residues around the target
 Evolutionary information (as a Position-Specific Scoring
Matrix) is the basis of this local information
 Each residue is characterised by a vector of 180 values
 The domain for all three features was partitioned into 5
states
Characterisation of the contact map
problem
 Three types of input information were used
1. Detailed information of three different windows of
residues centered around
 The two target residues (2x)
 The middle point between them
2. Information about the connecting segment between the
two target residues and
3. Global protein information.
1
2
3
Contact Map dataset
 From the original set of 3262 proteins we kept all that
had <250 AA and a randomly selected 20% of larger
proteins
 Still, the resulting training set contained 32 million pairs
of AA and 631 attributes
 Less than 2% of those are actual contacts
 +60GB of disk space
Samples and ensembles
 50 samples of 660K examples are
generated from the training set with a
ratio of 2:1 non-contacts/contacts
 BioHEL is run 25 times for each sample
 Prediction is done by a consensus of
1250 rule sets
 Confidence of prediction is computed
based on the votes distribution in the
ensemble.
 Whole training process took about 25K
CPU hours
Training set
x50
x25
Consensus
Predictions
Samples
Rule sets
Contact Map prediction in CASP
 Predictor groups are asked to submit a list of
predicted contacts and a confidence level for each
prediction
 The assessors then rank the predictions for each
protein and take a look at the top L/x ones, where L is
the length of the protein and x={5,10}
 From these L/x top ranked contacts two measures are
computed
 Accuracy: TP/(TP+FP)
 Xd: difference between the distribution of predicted
distance and a random distribution
CASP9 results
These two groups derived contact
predictions from 3D models
http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf
Understanding the rule sets
 Each rule set has in average 135 rules
 We have a total of 168470 rules
 Impossible to read all of them individually, but we can
extract useful statistics
 For instance, how often was each attribute used in the
rules?
 Full analysis
Distribution of frequency of use of
attributes
 All 631 attributes are
actually used (min
frequency=429)
 However, some of
them are used much
more frequently than
others
Top 10 attributesAttribute Frequency Count
s
PredSS_r1_1 1.48% 18141
PredCN_r1 1.66% 20336
propensity 1.74% 21288
PredSS_r2 1.75% 21350
PredSS_r1 1.82% 22205
PredRCH_r2 1.87% 22856
PredRCH_r1 2.04% 24961
PredSA_r2 2.12% 25891
PredSA_r1 2.39% 29246
separation 4.17% 50951
The four kind of residue’s predictions are highly ranked
Motivation PSP is a very costly process
 As an example, one of the best PSP methods CASP8,
Rosetta@Home could dedicate up to 104 computing
years to predict a single protein’s 3D structure
 One of the possible ways to alleviate this
computational cost is to simplify the representation
used to model the proteins
Target for reduction: the primary sequence
 The primary sequence of a protein is
an usual target for such simplification
 It is composed of a quite high cardinality
alphabet of 20 symbols, which share
commonalities between them
 One example of reduction widely used in
the community is the hydrophobic-polar
(HP) alphabet, reducing these 20 symbols
to just two
 HP representation usually is too simple,
too much information is lost in the
reduction process [Stout et al., 06]
 Can we automatically generate these
reduced alphabets and tailor them to
the specific problem at hand?
Automated Alphabet Reduction
[Bacardit et al., 09]
• We will use an automated information theory-driven
method to optimize alphabet reduction policies for PSP
datasets
• An optimization algorithm will cluster the AA alphabet into
a predefined number of new letters
• Fitness function of optimization is based on the Mutual
Information (MI) metric. A metric that quantifies the
interrelationship between two discrete variables
– Aim is to find the reduced representation that maintains as much
relevant information as possible for the feature being predicted
• Afterwards we will feed the reduced dataset into a
learning method to verify if the reduction was proper
Alphabet Reduction protocol
130
Dataset
Card=20
ECGA
Mutual
Information
Size = N
Dataset
Card=N
BioHEL
Test set
Accuracy
Ensemble
of rule sets
Automated Alphabet Reduction
 Competent 5-letter alphabet (similar performance to
the AA alphabet)
 Different alphabets for CN and SA domains
 Unexpected explanations: Alphabet reduction
clustered AA types that experts did not expect
Automated Alphabet Reduction
 Our method produces better reduced alphabets than other
reduced alphabets from the literature and than other expert-
designed ones
Alphabets
from the
literature
Expert
designed
alphabets
Alphabet Letters CN acc. SA acc. Diff. Ref.
AA 20 74.0±0.6 70.7±0.4 --- ---
Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07]
WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99]
SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00]
MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00]
MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06]
HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07]
HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07]
HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07]
Efficiency gains from the alphabet
reduction
 We have extrapolated the reduced alphabet to the much
larger and richer Position-Specific Scoring Matrices (PSSM)
representation
 Accuracy difference is still less than 1%
 Obtained rule sets are simpler and training process is much
faster
 Performance levels are similar to recent works in the literature
[Kinjo et al., 05][Dor and Zhou, 07]
 Won the bronze medal of the 2007 Humies awards
Conclusions
 Bioinformatics contain many challenges that computer
science can tackle
 Optimisation
 Machine learning
 Software engineering
 Evolutionary computation has shown to be very
competitive across a large range of bioinformatics
problems
 Facing these challenges for EC has led to the
development of many new methods
References/Bibliography Journals
 The Bioinformatics Journal
 BMC Bioinformatics
 BMC Biodata Mining
 Bioinformatics books
 Introduction to Bioinformatics by Arthur Lesk, Oxford University Press.
 Introduction to Bioinformatics. A. Tramontano, Chapman and
Hall/CRC
 Specialised topics
 Bioinformatics for –omics data. Methods and Protocols. Bernd
Mayer (ed). Springer
 Next-Generation Sequencing special issue of the Bioinformatics
Journal;
http://www.oxfordjournals.org/our_journals/bioinformatics/nextge
nerationsequencing.html
References/Bibliography
 J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number
prediction using Learning Classifier Systems: Performance and interpretability. In
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation
(GECCO2006), pp. 247-254, ACM Press, 2006
 Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull
Class Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008
 Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological
Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal,
13(3):245-258, 2009
 J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based
evolutionary learning. Memetic Computing journal 1(1):55-67, 2009
 J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated
Alphabet Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009
 George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume
Bacardit. Functional Network Construction in Arabidopsis Using Rule-Based Machine
Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011
 J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the
fusion of multiple predicted structural features. Bioinformatics first published online
July 25, 2012 doi:10.1093/bioinformatics/bts472
References/Bibliography
 Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies
Bioinformatics (2010) 26(4): 445-455
 Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and
sequence based descriptors for protein classification, Journal of Theoretical Biology
266(1):1-10, 2010
 Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for
protein function prediction, Memetic Computing 2(3):165-181, 2010
 Daniel Barthel et al., Procksi: a decision support system for protein (structure)
comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007
 http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics
 Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with
Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-
602.
 Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-
gene associations from Quantitative Association Rules In: 11th International Conference
on Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246
 Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano,
Yves Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of
Acknowledgements• Prof. Natalio Krasnogor
• Prof. Michael Holdsworth
• Prof. Jonathan Hirst
• Dr. Michael Stout
• Dr. George Bassel
• Dr. Enrico Glaab
• Dr. Pawel Widera
• EPSRC GR/T07534/01 & EP/H016597/1
• EU FP7 CADMAD project
Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk

Introduction

  • 2.
    Outline  What isBioinformatics?  Basic molecular biology  Public databases  Sequence analysis  The scales of bioinformatics  Biological data mining
  • 4.
    What is Bioinformatics? Several definitions exist. Michael Liebman proposed a quite elegant definition:  “The study of the information content and information flow in biological systems and processes” (Michael Liebman)  Information content: genome project  Information flow in biological systems: molecular transport  Biological systems: cells, organisms, …  Biological processes: metabolic networks  Bioinformatics is the science of using information to understand aspects of Biology. That is, a discipline where techniques such as applied mathematics, computer science, statistics, artificial intelligence, etc. are integrated to solve biological problems
  • 5.
    Information, information, information As we know there have been major advances in the field of molecular biology  These have been coupled with advances in laboratory (post)genomic technology  This has led to an explosive growth in the collection of biological information  This deluge of information has led to an absolute requirement for 1. Computerized databases to store, organize and index the data 2. For specialized tools to view and analyze the data 3. Specialized tools to infer new knowledge from the data
  • 6.
    Areas of research(taxonomyof the Bioinformatics Journal)  Genome Analysis  Sequence Analysis  Phylogenetics  Structural Bioinformatics  Gene Expression  Genetics and Population Analysis  Systems Biology  Data and Text Mining  Databases  Bioimage Informatics
  • 8.
    Life begins withCell  A cell is the smallest structural unit of an organism that is capable of sustained independent functioning  All cells have some common features  What is Life? Can we create it in the lab? Read: The imitation game—a computational chemical approach to recognizing life. Nature Biotechnology, 24:1203-1206, 2006
  • 9.
    2 types ofcells: Prokaryotes & Eukaryotes
  • 10.
    Example of cellsignaling
  • 11.
    Terminology  The genomeis an organism’s complete set of DNA.  a bacteria contains about 600,000 DNA base pairs  human and mouse genomes have some 3 billion.  human genome has 23 distinct chromosomes.  Each chromosome contains many genes.  Gene  basic physical and functional units of heredity.  specific sequences of DNA bases that encode instructions on how and when to make proteins.  Proteins  Make up the cellular structure  large, complex molecules made up of smaller subunits called amino acids.
  • 12.
    All Life dependson 3 critical molecules  DNAs  Hold information on how cell works  RNAs  Act to transfer short pieces of information to different parts of cell  Provide templates to synthesize into protein  Proteins  Form enzymes that send signals to other cells and regulate gene activity  Form body’s major components (e.g. hair, skin, etc.)  Are life’s laborers!  Computationally, all three can be represented as sequences of a certain 4-letter (DNA/RNA) or 20-letter (Proteins) alphabet
  • 13.
    DNA, RNA, andthe Flow of Information TranslationTranscription Replication Weismann Barrier / Central Dogma of Molecular Biology
  • 14.
    Overview of DNAto RNA to Protein  A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis
  • 15.
    DNA: The Basisof Life  Deoxyribonucleic Acid (DNA)  Double stranded with complementary strands A-T, C-G  DNA is a polymer  Sugar-Phosphate-Base  Bases held together by H bonding to the opposite strand
  • 16.
    RNA  RNA issimilar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil)  Some forms of RNA can form secondary structures by“pairing up” with itself. This can have impact on its properties dramatically. DNA and RNA can pair with each other.http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:
  • 17.
    RNA, continued Several typesexist, classified by function:  hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary transcipts with introns that have not yet been excised (pre-mRNA).  mRNA: this is what is usually being referred to when a Bioinformatician says “RNA”. This is used to carry a gene’s message out of the nucleus.  tRNA: transfers genetic information from mRNA to an amino acid sequence as to build a protein  rRNA: ribosomal RNA. Part of the ribosome which is involved in translation.
  • 18.
    Transcription Transcription ishighly regulated. Most DNA is in a dense form where it cannot be transcribed.  To start, transcription requires a promoter, a small specific sequence of DNA to which polymerase can bind (~40 base pairs “upstream” of gene)  Finding these promoter regions is only a partially solved problem that is related to motif finding.  There can also be repressors and inhibitors acting in various ways to stop transcription. This makes regulation of gene transcription complex to understand.
  • 19.
    Definition of aGene  Regulatory regions: up to 50 kb upstream of +1 site  Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp)  Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron  Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
  • 20.
  • 21.
    Splicing and otherRNA processing  In Eukaryotic cells, RNA is processed between transcription and translation.  This complicates the relationship between a DNA gene and the protein it codes for.  Sometimes alternate RNA processing can lead to an alternate protein (splice variants) as a result. This is true in the immune system.
  • 22.
    Proteins: Crucial molecules forthe functioning of life • Structural Proteins: the organism's basic building blocks, eg. collagen, nails, hair, etc. • Enzymes: biological engines which mediate multitude of biochemical reactions. Usually enzymes are very specific and catalyze only a single type of reaction, but they can play a role in more than one pathway. • Transmembrane proteins: they are the cell’s housekeepers, eg. By regulating cell volume, extraction and concentration of small molecules from the extracellular environment and generation of ionic gradients essential for muscle and nerve cell function (sodium/potasium pump is an example) • Proteins are polypeptide chains, constructed by joining a certain kind of peptides, amino acids, in a linear way • The chain of amino acids, however folds to create very complex 3D structures
  • 23.
    Translation  The processof going from RNA to polypeptide.  Three base pairs of RNA (called a codon) correspond to one amino acid based on a fixed table.  Always starts with Methionine and ends with a stop codon
  • 24.
  • 25.
    Protein Structure: Introduction Different amino acids have different properties  These properties will affect the protein structure and function  Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process
  • 26.
    Protein Structure: Hierarchicalnature of protein structure MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE Primary Structure = Sequence of amino acids Secondary Structure Tertiary Local Interactions Global Interactions
  • 27.
    Protein Structure: Whyis structure important?  The function of a protein depends greatly on its structure  The structure that a protein adopts is vital to it’s chemistry  Its structure determines which of its amino acids are exposed to carry out the protein’s function  Its structure also determines what substrates it can react with
  • 28.
    Protein Structure: Mostlylacking information  Therefore, it is clear that knowing the structure of a protein is crucial for many tasks  However, we only know the structure for a very small fraction of all the proteins that we are aware of  The UniProtKB/TrEMBL archive contains 23165610 (16886838) sequences  The PDB archive of protein structure contains only 84223(76669) structures  In the native state, proteins fold on its own as soon as they are generated, amino-acid by amino-acid (with few exceptions e.g. chaperones)  can we predict this process as to close the gap between protein sequences and their 3D structures?
  • 29.
    Central Dogma ofBiology: A Bioinformatics Perspective The information for making proteins is stored in DNA. There is a process (transcription and translation) by which DNA is converted to protein. By understanding this process and how it is regulated we can make predictions and models of cells. Sequence analysis Gene Finding Protein Sequence/Stru cture Analysis Assembly Computational Problems
  • 31.
    Information flow inbioinformatics  Data enters the “bioinformatics scope” when a scientist deposits an experimental result in an appropriate archive  The archive curates and annotates the data  The data is released to the public  Afterwards, the data may be retrieved/analysed:  Integrating the new entry into a search engine  Extracting useful subsets of the data  Deriving new types of information from the data  Aggregating the data, by homology, function, structure  Reannotating the data with new discovered/inferred info.  Quality of data depends on many factors, the techniques used to experimentally create the data, degree of inference and prediction involved in the annotation process, etc.  Many publicly available databases: http://en.wikipedia.org/wiki/List_of_biological_databases
  • 32.
    NCBI’s Entrez system http://www.ncbi.nlm.nih.gov/ Entrezis a search and retrieval system that integrates information from databases at NCBI (National Center for Biotechnology Information).
  • 33.
    Uniprot http://www.uniprot.org  TheUniversal Protein Resource (UniProt) is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)
  • 34.
    KEGG - http://www.genome.jp/kegg/ Not just about genes/proteins but also pathways, that is, their interactions
  • 35.
  • 37.
    Sequences  Be itDNA, RNA or proteins we have many data that can be represented as sequences of a certain alphabet  Many generic algorithms to deal with biological sequences exist  Sequence alignment  Motif representation
  • 38.
    Sequence Alignment Isthe assignment of residue-residue correspondences between nucleotide/proteomic sequences Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60 Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107 ... Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ + C P+ Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335 Query 361 S-PSVN 365 P VN Sbjct 336 KFPGVN 341 gap matches mismatches
  • 39.
    Motivation  Similarity isexpected among biomolecules that are descended from a common ancestor.  Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function  Important catalytic, regulatory or structural regions remain similar  An alignment between two or more genetic or proteomic sequences represents an explicit hypothesis via their evolutionary histories.  Thus comparison of related gene/protein sequences have been instrumental in shedding light into the information content of these sequences and their biological functions.
  • 40.
    Definition and aims Why align sequences? 1. Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. 2. Start with a small set of sequences and identify similarities and differences among them. 3. In many sequences or very long sequences, detect commonly occurring patterns
  • 41.
    Similarity vs. Homology Similarity is the observation or measurement of resemblance and difference, independent of the source of resemblance.  There are many examples of different organisms with functionally similar organs that came from distinc evolutionary origins  When similarity is due to a common ancestry, we call it homology.  Sequence alignment helps inferring homology hypothesis:  If two sequences are very similar, it is probable that there is a common origin  Therefore, if we know some information (structure, function) from sequence X, and sequence X is similar to sequence Y, it is probable that the same information applies to Y
  • 42.
    Metrics of similarity:Definitions  Gap: a break in the alignment, in either one of the sequences.  For nucleotides, a consequence of an insertion or deletion mutation.  For proteins, it’s more difficult to say.  Regions of matching residues.  Indicate parts of a sequence that are well conserved  Mismatched residues.  For nucleotides, a consequence of a substitution mutation  Less conserved regions
  • 43.
    Metrics of similarity:Distance scoring  Distance scoring  Given an alignment with matches, mismatches and gaps, we compute a score following:  For each mismatch, score is increased by 2  For each gap, score is increased by 4  For each match, no increase in score  Higher score, less similarity  Equivalent metrics exist for similarity (not distance) where higher score means good similarity = 18 A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0
  • 44.
    Metrics of similarity:Mismatches and gaps  Are all mismatches equally bad?  For protein sequences, there are several subgroups of amino acids with similar properties. Mismatches within a group have less impact  For nucleotide sequences, transition mutations (a↔g and t↔c) are more common than transversions (a or g ↔ t or c) mutations  Distance scoring of mismatches could be smarter  substitution matrices  Using statisical analysis on large corpus of real sequences to generate better scores  How to penalize gaps  Each gap slot gets equal distance score  One score to open a gap, another (smaller) score to extend the same gap
  • 45.
    Global vs Localalignment  We know how to score good or bad alignments  How to find the optimal one?  Two classes of alignment methods  Global alignment  Finds the best alignment of one entire sequence with another entire sequence  Local alignment  Find the best alignment of one segment of a sequence against another segment of another sequence
  • 46.
    Exact vs. Approximatemethods  Exact methods for both global and local alignment exist, based on dynamic programming, but are slow  Good enough when there are few sequences  Not so good when comparing a target sequence to a database of millions of known sequences  Approximate methods have been used for many years for large- scale alignment tasks  They use some kind of heuristic to speed up the alignment process  BLAST (Basic Local Alignment Search Tool) is the most famous approximate method  It identifies potential hits by looking for perfect matches of very small sub-sequences (seeds)  It only tries to create a full alignment for sequences where several seeds are identified  PSI-BLAST: version that takes into account that multiple hits are identified. It constructs a tailored substitution matrix based on hits and then refines the alignment
  • 47.
    Multiple Sequence Alignment When we have to align more than two sequences  Progressive methods (e.g. ClustalW)  Start with seed alignment  Iteratively incorporate other alignments to seed, without modifying what is aligned so far  ClustalW uses phylogenetic trees (representations of the evolutionary relationship between sequences) to progressively construct MSA  Iterative methods (e.g. MUSCLE)  Can re-edit the partial MSA based on the newly incorporated alignments
  • 48.
  • 49.
    Motifs  When visualisinga MSA we can see regions of high agreement and regions of low agreement.  The high agreement regions define that a certain protein belongs to a family  What if we concentrate on modelling and identifying these regions instead of the whole sequences  Motif finding
  • 50.
    DNA -> RNA-> Proteins
  • 51.
    DNA  Coding/non coding SNPs  Copy number variation  Assembly  Methylation  Primer design
  • 52.
    Coding/Non Coding  Identifyingthe regions from an organism’s genome that contain genes  Many different factors involved in this identification  Promoter identification  Long enough Open Reading Frames (ORF)  Splice variants  Introns/Exons (in Eucaryotes)  Statistical properties of gene-coding DNA  HMM are also used for gene finding
  • 53.
    Single Nucleotide Polymorphisms (SNPs) One base-pair variation in DNA  In most cases in non-coding regions of DNA, but not always  When frequent enough in a population they can be linked to specific traits, e.g. a disease  SNP microarrays can be used to probe hundreds of thousands of SNPs in parallel  In reality few SNPs act on their own  Genome-Wide Association Studies identify groups of SNPs linked to a certain condition
  • 54.
    Copy Number Variation In general two copies of each gene exist in a genome  It may be the cases than more/less than two copies exist of a certain gene for a specific sub-population  It has been suggested that certain CNV can be linked to specific diseases
  • 55.
    Genome assembly  Sequencingtechnologies are able to read (sequence) a complete genome as a series of short overlapping fragments  How to assemble back all these fragments?  Greedy approach  Pair-wise alignments of all fragments  Merge fragments of largest overlap  Keep iterating until all segments are merged  Worked more or less well on old sequencing technologies, not so well on next-generation sequencing data, due to smaller fragment sizes and larger error rate
  • 56.
    Genome mapping  Givena large set of short fragments, as a result of next- generation sequencing, map them to a reference genome  Different from previous one. We do not want to reconstitute a complete genome, just identify to which genes each fragment belongs (among other applications).  Speed is an issue  Modern methods (e.g. SOAP2) compress the genome and are able to align the fragments in the compressed space
  • 57.
    Methylation  It isa chemical reaction that can block a certain region of a chromosome, preventing its transcription  The process can be reverted, so essentially it is an on/off switch of the affected gene  Specialised microarrays exist for the high-throughput detection of methylated genes  Afterwards, data analysis can take place
  • 58.
    DNA library specification •A DNA library is a combinatorial set of DNA sequences suited to manufacture via DNA reuse • The first stage towards the creation of a DNA library is the formal specification of the target DNA molecules that comprise it • A set of sequences does not convey the intention behind the library Key challenge is to enable precise editing of DNA sequences in an extensible and reproducible manner whilst avoiding manual handling of these unwieldy objects
  • 59.
    DNALD library format A DNALD library consists of three sets of definitions: inputs, intermediates and outputs, with different semantics  Inputs: existing DNA sequences to be provided with design  Intermediates: conceptual means of factoring commons seqs  Outputs: to be produced through DNA reuse
  • 60.
    DNALD expressions  ADNALD expression is a combination of explicit sequences, definition names, operators and functions that are interpreted according to rules of precedence and association ("evaluated") to produce a set of DNA sequences.  Definitions bind names to the results of expressions.
  • 61.
    Workbench interface text editorwith: • syntax highlighting • auto-completion • code folding • etc. manage projects viewed from different perspectives
  • 62.
    CADMAD’s DNALD (DNALibrary Design) A specification language that produces a set of target DNA sequences as a function of operations on a set of inputs To maximise CADMAD's impact the specification process must be:  user friendly and debuggable  but expressively powerful enough to:  define non-trivial combinatorial constructs  communicate degrees of freedom >Ret_human GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT ACTGAAGAAGTAGGCT >Ret_zebrafish GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC GAAGCAGATTCACTGAAAACGGATG >Ret_chicken GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC
  • 63.
  • 64.
    RNA expression  Notall genes are transcribed/translated into proteins all the time  The expression of genes is highly sophisticated and depends on many factors  Identifying the genes being expressed in a given point of time in a specific tissue provides crucial information about the roles and interactions of such genes  Compare the genes expressed between different groups of samples to identify those that are differentially expressed  Identify co-expressed genes, that present patterns of correlation
  • 65.
    Measuring RNA expression RT-PCR (Real-time reverse polimerase chain reaction)  Measures accurately the expression of a pre-determined gene  RNA Microarrays  Measures, in parallel, the expression of tens of thousands of genes  RNA-Seq  The next-generation sequencing variant for measuring gene expresison
  • 66.
    RNA Structure prediction A RNA sequence can bind with itself to create complex shapes with a certain pattern of loops  Can we predict, from a given sequence, the structural shape of the RNA?
  • 67.
    Proteins  Protein classification Structure prediction  Structure comparison  Function and interaction
  • 68.
    Protein classification  Proteinscan be annotated in many different ways  Function  DNA-binding? Enzyme?  Tissue/Cellular/Sub-cellular localisation  Interacting with other proteins?  Can we predict this annotation using ML?  We need to transform the protein sequence into a uniform representation of equal size for all proteins  Many different representations exist  Several of these problems can be modelled as a hierarchical classification problem
  • 69.
    Protein Structure PredictionPSP aims to predict the 3D structure of a protein based on its primary sequence
  • 70.
    Protein Structure Prediction PSP is an open problem. The 3D structure depends on many variables  It has been one of the main holy grails of computational biology for many decades  Impact of having better protein structure models are countless  Genetic therapy  Synthesis of drugs for incurable diseases  Improved crops  Environmental remediation
  • 71.
    Prediction types ofPSP There are several kinds of prediction problems within the scope of PSP  The main one, of course, is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence  There are many structural properties of individual residues within a protein that can be predicted, for instance:  The secondary structure state of the residue  If a residue is buried in the core of the protein or exposed in the surface  Accurate predictions of these sub-problems can simplify the general 3D PSP problem
  • 72.
    3D Protein StructurePrediction  Some PSP methods try to find similar proteins and then adapt the structure of the homolog (template) to the target protein  Homology Modeling  Other methods try to find the structure of the protein from scratch (Ab Initio Modelling), optimizing some energy function that models the stability of the protein, in case no homolog can be identified  In between there are other kind of methods, for varying degrees of good homology of our target, for instance, Fold Recognition or Threading • These methods identify a target based on more than homology (i.e. sequence alignment).
  • 73.
    Coordination Number PredictionTworesidues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) CN of a residue : count of contacts that a certain residue has CN gives us a simplified profile of the density of packing of the protein ContactPrimary Sequence Native State
  • 74.
    Contact Map predictionPrediction, given two residues from a chain, whether these two residues are in contact or not  This problem can be represented by a binary matrix. 1= contact, 0 = non contact  Plotting this matrix reveals many characteristics from the protein structure  Very sparse characteristic: Less than 2% of contacts in native structures helices sheets
  • 75.
    Other predictions Otherkinds of residue structural aspects that can be predicted  Solvent accessibility: Amount of surface of each residue that is exposed to solvent  Recursive Convex Hull: A metric that models a protein as an onion, and assigns each residue to a layer. Formally, each layer is a convex hull of points  These features (and others) are predicted in a similar was as done for SS
  • 76.
  • 77.
    Protein Structure Comparison Protein Structure Comparison (PSC) aims at  Assess the degree of similarity between protein structures  Given a query structure, identify other proteins with similar structure  Why?  Group proteins by structural similarities  Determine the impact of individual residues on the protein structure  Identify distant homologues of protein families  Predict function of proteins with low degree of primary structure (i.e.. sequence) similarity with other proteins  Engineer new proteins for specific functions  Assess ab-initio predictions
  • 78.
    Protein Structure Comparison Sequence-Structure-Function relationships 1) Conserved 1º sequences similar structures 2) Similar structures conserved 1º sequences 3) Similar structures conserved function  PSC shares many similarities with sequence alignment. Our aim is to infer new knowledge from the comparison process ?
  • 79.
    Protein Structure ComparisonExisting Approaches  SSAP (Orengo & Taylor, 96)  ProSup (Feng & Sippl, 96)  DALI (Holm & Sander, 93)  CE (Shindyalov & Bourne, 98)  LGA (Zemla, 2003)  SCOP (Murzin, Brenner, Hubbard & Chothia, 95)  CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97)  ProCKSI – Consensus of multiple PSC methods
  • 80.
    Prediction of ProteinFunction  In an ideal world, the cascade of inference should flow from sequence  structure  function  That is, if we can identify similar sequences of structures to our query target we can (at varying degrees of certainty) infer that they have similar function
  • 81.
    Prediction of ProteinFunction  As proteins evolve, they may  Retain function and specificity  Retain function but alter specificity  Change to a related function, or a similar function in a different metabolic contxt  Change to a completely unrelated function  How much must a protein change before the function changes?  Sometimes, not at all. There are many cases of proteins with different functions in different environments
  • 82.
    Prediction of ProteinFunction  Thus, sequence or structure similarity is not always reliable to assign function  Other ways of determining protein function  By identifying patterns of co-regulated genes  Using data from Microarray experiments  By identifying protein-protein interactions
  • 83.
    Prediction of ProteinFunction  A related question is: where is the function of a protein taking place?  active site  Several methods exist to predict active/binding sites of proteins from local patterns of sequence or structure  A raw way of doing this prediction is to take a look at the conserved residues of a sequence  they may be related to either the core of the protein (structural stability) or the function of a protein (a change of function is a risk for survival)  More sophisticated methods exists to learn how to predict active sites. They use ML, in a similar way used to predict residue structural features in PSP  Still, it is a very tough problem, and ML methods are not much better than blast-based methods
  • 85.
    Three case studies Mining –omics data  Predicting structural aspects of protein residues  Automated alphabet reduction for protein datasets  In all these three case studies we use the same evolutionary learning system: BioHEL [Bacardit et al., 09]
  • 86.
    BioHEL BioHEL [Bacarditet al., 09] is an evolutionary learning system that applies the Iterative Rule Learning (IRL) approach  Designed explicitly to deal with noisy large-scale datasets  IRL was first used in EC by the SIA system [Venturini, 93]
  • 87.
    BioHEL’s learning paradigmIRL has been used for many years in the ML community, with the name of separate-and-conquer
  • 88.
    BioHEL’s objective functionAn objective function based on the Minimum- Description-Length (MDL) (Rissanen,1978) principle that tries to promote rules with  High accuracy: not making mistakes  High coverage: covering as much examples as possible without sacrificing accuracy. Recall (TP/(TP+FN)) will be used to define coverage  Low complexity: rules as simple and general as possible  The objective function is a linear combination of the three objectives above
  • 89.
    BioHEL’s objective function Intuitively, we would like to have accurate rules covering as much examples as possible.  However, in complex and inconsistent domains it is rare to obtain such rules  In these cases, easier path for evolutionary search is to maximize accuracy at the expense of coverage  Therefore, we need to enforce that the evolved rules cover enough examples
  • 90.
    BioHEL’s objective function Three parameters define the shape of the function  The choice of the coverage break is crucial for the proper performance of the system  Also, coverage term penalizes rules that do not cover a minimum percentage of examples or that cover too many
  • 91.
    BioHEL’s characteristics Attributelist rule representation  Automatically identifying the relevant attributes for a given rule and discarding all the other ones  The ILAS windowing scheme  Efficiency enhancement method, not all training points are used for each fitness computation  An explicit default rule mechanism  Generating more compact rule sets  Iterative process terminates when it is impossible to evolve a rule where the associated class is the majority class among the matched examples  At this point, all remaining training instances are assigned to the default class
  • 93.
    Mining –omics data Biological data can be generated at many different levels  Genomics (DNA)  Transcriptomics (RNA)  Proteomics (proteins)  Metabolomics (small compounds)  Lipidomics (lipids)  Hundreds of –omics have been catalogued
  • 94.
    How an –omicsdataset looks like?  In most cases datasets present a similar structure  Each sample is characteristed by a large number of variables (RNA, Proteins, lipids, etc.)  Each variable indicates (usually quantitatively) the presence of that element in the sample  Due to the high cost of most –omics technologies, variables >> samples  Problems of over-fitting
  • 95.
    What can wedo with the dataset?  In most cases, samples are annotated with a qualitative label  Cancer/Non-cancer patients  Samples of seed tissue for which it is known if the seed germinated or not  Age of the sample  Therefore, we can treat these datasets as classification problems, and generate prediction models from the data  Not just as classification problems  Clustering/Biclustering  Association Rule Mining  Regression
  • 96.
    But in mostcases, domain experts are not (only) interested in predictions  Biomarker identification  Identify the key variables  Most strongly associated to each outcome  Using e.g. t-tests to identify those  Presenting higher prediction capacity  As identified by ML methods  Identify interactions between variables  By presenting very high (anti)correlation between them  By acting together to generate predictions
  • 97.
    Functional Network Reconstruction forseed germination Microarray data obtained from seed tissue of Arabidopsis Thaliana  122 samples represented by the expression level of almost 14000 genes  It had been experimentally determined whether each of the seeds had germinated or not  Can we learn to predict germination/dormancy from the microarray data?  [Bassel et al., 2011]
  • 98.
    Generating rule sets BioHEL was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross- validation  Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracy If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict germination If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict germination If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germination If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80  Predict germination Everything else  Predict dormancy
  • 99.
    Identifying regulators  Rulebuilding process is stochastic  Generates different rule sets each time the system is run  But if we run the system many times, we can see some patterns in the rule sets  Genes appearing quite more frequent than the rest  Some associated to dormancy  Some associated to germination
  • 100.
    Known regulators appearwith high frequency in the rules
  • 101.
    Generating co-prediction networksof interactions • For each of the rules shown before to be true, all of the conditions in it need to be true at the same time – Each rule is expressing an interaction between certain gens • From a high number of rule sets we can identify pairs of genes that co-occur with high frequency and generate functional networks • The network shows different topology when compared to other type of network construction methods (e.g. by gene co- expression) • Different regions in the network contain the germination and dormancy genes
  • 102.
    Experimental validation  Wehave experimentally verified this analysis  By ordering and planting knockouts for the highly ranked genes  We have been able to identify four new regulators of germination, with different phenotype from the wild type
  • 104.
    Prediction of structuralaspects of protein residues  Many of these features are due to local interactions of an amino acid and its immediate neighbours  Can it be predicted using information from the closest neighbours in the chain?  In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target Ri SSi Ri+1 SSi+1 Ri-1 SSi-1 Ri+2 SSi+2 Ri-2 SSi-2 Ri+3 SSi+3 Ri+4 SSi+4 Ri-3 SSi-3 Ri-4 SSi-4 Ri-5 SSi-5 Ri+5 SSi+5 Ri-1 Ri Ri+1  SSi Ri Ri+1 Ri+2  SSi+1 Ri+1 Ri+2 Ri+3  SSi+2
  • 105.
    ARFF file fora simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1
  • 106.
    What information dowe include for each residue?  Early prediction methods used just the primary sequence  the AA types of the residues in the window  However the primary sequence has limited amount of information  It does not contain any evolutionary information it does not say which residues are conserved and which are not  Where can we obtain this information?  Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment
  • 107.
    Position-Specific Scoring Matrices(PSSM) – For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) – This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence – In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning – A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions
  • 108.
    PSSM for the10 first residues of 1n7lAA R N D C Q E G H I L K M F P S T W Y V A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0 M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1 E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3 K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3 V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5 Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3 Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2 L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1 T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
  • 109.
    Secondary Structure Prediction –The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state – Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP – Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM). – A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances
  • 110.
    Secondary Structure Prediction R1R2 R3 Rn-1 Rn Primary sequence MSA PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn PSSM profile of sequence Windows generation PSSMi-1 PSSMi PSSMi+1 Prediction method SSi? Window of PSSM profilesPrediction
  • 111.
    Other prediction problems This same structure of prediction can be applied to most 1D structural aspects  However, many of these features are natively continuous measures (or integer)  To treat these problems as classification problems, we need to discretise the output  Unsupervised methods are applied  Uniform length and uniform frequency disc. UL UF
  • 112.
    PSP datasets aregood ML benchmarks  These problems can be modelled in may ways:  Regression or classification problems  Low/high number of classes  Balanced/unbalanced classes  Adjustable number of attributes  Ideal benchmarks !!  http://icos.cs.nott.ac.uk/datasets/psp_benchmark.ht ml
  • 113.
    Contact Map Prediction We participated in the CASP9 competition  CASP = Critical Assessment of Techniques for Protein Structure Prediction. Biannual competition  Every day, for about three months, the organizers release some protein sequences for which nobody knows the structure (129 sequences were released in CASP9, in 2010)  Each prediction group is given three weeks to return their predictions  If the machinery is not well oiled, it is not feasible to participate !!  For CM, prediction groups have to return a list of predicted contacts (they are not interested in non-contacts) and, for each predicted pair of contacting residues, a confidence level
  • 114.
    Contact Map predictionPrediction given two residues from a chain whether these two residues are in contact or not  This problem can be represented by a binary matrix. 1= contact 0 = non contact  Plotting this matrix reveals many characteristics from the protein structure helices sheets
  • 115.
    Steps for CMprediction (Nottingham method) 1. Prediction of  Secondary structure (using PSIPRED)  Solvent Accessibility  Recursive Convex Hull  Coordination Number 2. Integration of all these predictions plus other sources of information 3. Final CM prediction (using BioHEL) Using BioHEL [Bacardit et al., 09]
  • 116.
    Prediction of RCH,SA and CN  We selected a set of 3262 protein chains from PDB- REPRDB with:  A resolution less than 2Å  Less than 30% sequence identify  Without chain breaks nor non-standard residues  90% of this set was used for training (~490000 residues)  10% for test
  • 117.
    Prediction of RCH,SA and CN  All three features were predicted based on a window of ±4 residues around the target  Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information  Each residue is characterised by a vector of 180 values  The domain for all three features was partitioned into 5 states
  • 118.
    Characterisation of thecontact map problem  Three types of input information were used 1. Detailed information of three different windows of residues centered around  The two target residues (2x)  The middle point between them 2. Information about the connecting segment between the two target residues and 3. Global protein information. 1 2 3
  • 119.
    Contact Map dataset From the original set of 3262 proteins we kept all that had <250 AA and a randomly selected 20% of larger proteins  Still, the resulting training set contained 32 million pairs of AA and 631 attributes  Less than 2% of those are actual contacts  +60GB of disk space
  • 120.
    Samples and ensembles 50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts  BioHEL is run 25 times for each sample  Prediction is done by a consensus of 1250 rule sets  Confidence of prediction is computed based on the votes distribution in the ensemble.  Whole training process took about 25K CPU hours Training set x50 x25 Consensus Predictions Samples Rule sets
  • 121.
    Contact Map predictionin CASP  Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction  The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}  From these L/x top ranked contacts two measures are computed  Accuracy: TP/(TP+FP)  Xd: difference between the distribution of predicted distance and a random distribution
  • 122.
    CASP9 results These twogroups derived contact predictions from 3D models http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf
  • 123.
    Understanding the rulesets  Each rule set has in average 135 rules  We have a total of 168470 rules  Impossible to read all of them individually, but we can extract useful statistics  For instance, how often was each attribute used in the rules?  Full analysis
  • 124.
    Distribution of frequencyof use of attributes  All 631 attributes are actually used (min frequency=429)  However, some of them are used much more frequently than others
  • 125.
    Top 10 attributesAttributeFrequency Count s PredSS_r1_1 1.48% 18141 PredCN_r1 1.66% 20336 propensity 1.74% 21288 PredSS_r2 1.75% 21350 PredSS_r1 1.82% 22205 PredRCH_r2 1.87% 22856 PredRCH_r1 2.04% 24961 PredSA_r2 2.12% 25891 PredSA_r1 2.39% 29246 separation 4.17% 50951 The four kind of residue’s predictions are highly ranked
  • 127.
    Motivation PSP isa very costly process  As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure  One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins
  • 128.
    Target for reduction:the primary sequence  The primary sequence of a protein is an usual target for such simplification  It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them  One example of reduction widely used in the community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two  HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06]  Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?
  • 129.
    Automated Alphabet Reduction [Bacarditet al., 09] • We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets • An optimization algorithm will cluster the AA alphabet into a predefined number of new letters • Fitness function of optimization is based on the Mutual Information (MI) metric. A metric that quantifies the interrelationship between two discrete variables – Aim is to find the reduced representation that maintains as much relevant information as possible for the feature being predicted • Afterwards we will feed the reduced dataset into a learning method to verify if the reduction was proper
  • 130.
    Alphabet Reduction protocol 130 Dataset Card=20 ECGA Mutual Information Size= N Dataset Card=N BioHEL Test set Accuracy Ensemble of rule sets
  • 131.
    Automated Alphabet Reduction Competent 5-letter alphabet (similar performance to the AA alphabet)  Different alphabets for CN and SA domains  Unexpected explanations: Alphabet reduction clustered AA types that experts did not expect
  • 132.
    Automated Alphabet Reduction Our method produces better reduced alphabets than other reduced alphabets from the literature and than other expert- designed ones Alphabets from the literature Expert designed alphabets Alphabet Letters CN acc. SA acc. Diff. Ref. AA 20 74.0±0.6 70.7±0.4 --- --- Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07] WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99] SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00] MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00] MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06] HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07] HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07] HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07]
  • 133.
    Efficiency gains fromthe alphabet reduction  We have extrapolated the reduced alphabet to the much larger and richer Position-Specific Scoring Matrices (PSSM) representation  Accuracy difference is still less than 1%  Obtained rule sets are simpler and training process is much faster  Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07]  Won the bronze medal of the 2007 Humies awards
  • 134.
    Conclusions  Bioinformatics containmany challenges that computer science can tackle  Optimisation  Machine learning  Software engineering  Evolutionary computation has shown to be very competitive across a large range of bioinformatics problems  Facing these challenges for EC has led to the development of many new methods
  • 135.
    References/Bibliography Journals  TheBioinformatics Journal  BMC Bioinformatics  BMC Biodata Mining  Bioinformatics books  Introduction to Bioinformatics by Arthur Lesk, Oxford University Press.  Introduction to Bioinformatics. A. Tramontano, Chapman and Hall/CRC  Specialised topics  Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer (ed). Springer  Next-Generation Sequencing special issue of the Bioinformatics Journal; http://www.oxfordjournals.org/our_journals/bioinformatics/nextge nerationsequencing.html
  • 136.
    References/Bibliography  J. Bacardit,M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number prediction using Learning Classifier Systems: Performance and interpretability. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp. 247-254, ACM Press, 2006  Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008  Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal, 13(3):245-258, 2009  J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing journal 1(1):55-67, 2009  J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated Alphabet Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009  George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011  J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics first published online July 25, 2012 doi:10.1093/bioinformatics/bts472
  • 137.
    References/Bibliography  Jason H.Moore et al., Bioinformatics challenges for genome-wide association studies Bioinformatics (2010) 26(4): 445-455  Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and sequence based descriptors for protein classification, Journal of Theoretical Biology 266(1):1-10, 2010  Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for protein function prediction, Memetic Computing 2(3):165-181, 2010  Daniel Barthel et al., Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007  http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics  Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590- 602.  Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene- gene associations from Quantitative Association Rules In: 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246  Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano, Yves Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of
  • 138.
    Acknowledgements• Prof. NatalioKrasnogor • Prof. Michael Holdsworth • Prof. Jonathan Hirst • Dr. Michael Stout • Dr. George Bassel • Dr. Enrico Glaab • Dr. Pawel Widera • EPSRC GR/T07534/01 & EP/H016597/1 • EU FP7 CADMAD project
  • 139.
    Dr. Jaume Bacardit InterdisciplinaryComputing and Complex Systems (ICOS) research group University of Nottingham jaume.bacardit@nottingham.ac.uk