Introduction

Outline
 What is Bioinformatics?
 Basic molecular biology
 Public databases
 Sequence analysis
 The scales of bioinformatics
 Biological data mining

What is Bioinformatics?
 Several definitions exist. Michael Liebman proposed a
quite elegant definition:
 “The study of the information content and information flow in
biological systems and processes” (Michael Liebman)
 Information content: genome project
 Information flow in biological systems: molecular transport
 Biological systems: cells, organisms, …
 Biological processes: metabolic networks
 Bioinformatics is the science of using information to
understand aspects of Biology. That is, a discipline where
techniques such as applied mathematics, computer science,
statistics, artificial intelligence, etc. are integrated to solve
biological problems

Information, information, information
 As we know there have been major advances in the
field of molecular biology
 These have been coupled with advances in
laboratory (post)genomic technology
 This has led to an explosive growth in the
collection of biological information
 This deluge of information has led to an absolute
requirement for
1. Computerized databases to store, organize and index the
data
2. For specialized tools to view and analyze the data
3. Specialized tools to infer new knowledge from the data

Areas of research(taxonomy of the
Bioinformatics Journal)
 Genome Analysis
 Sequence Analysis
 Phylogenetics
 Structural Bioinformatics
 Gene Expression
 Genetics and Population Analysis
 Systems Biology
 Data and Text Mining
 Databases
 Bioimage Informatics

Life begins with Cell
 A cell is the smallest structural unit of an organism that is capable of
sustained independent functioning
 All cells have some common features
 What is Life? Can we create it in the lab? Read:
The imitation game—a computational chemical approach to
recognizing life. Nature Biotechnology, 24:1203-1206, 2006

2 types of cells:
Prokaryotes & Eukaryotes

Terminology
 The genome is an organism’s complete set of DNA.
 a bacteria contains about 600,000 DNA base pairs
 human and mouse genomes have some 3 billion.
 human genome has 23 distinct chromosomes.
 Each chromosome contains many genes.
 Gene
 basic physical and functional units of heredity.
 specific sequences of DNA bases that encode
instructions on how and when to make proteins.
 Proteins
 Make up the cellular structure
 large, complex molecules made up of smaller subunits
called amino acids.

All Life depends on 3 critical molecules
 DNAs
 Hold information on how cell works
 RNAs
 Act to transfer short pieces of information to different parts of cell
 Provide templates to synthesize into protein
 Proteins
 Form enzymes that send signals to other cells and regulate gene
activity
 Form body’s major components (e.g. hair, skin, etc.)
 Are life’s laborers!
 Computationally, all three can be represented as
sequences of a certain 4-letter (DNA/RNA) or 20-letter
(Proteins) alphabet

DNA, RNA, and the Flow of Information
TranslationTranscription
Replication
Weismann
Barrier /
Central
Dogma of
Molecular
Biology

Overview of DNA to RNA to Protein
 A gene is expressed in two steps
1) Transcription: RNA synthesis
2) Translation: Protein synthesis

DNA: The Basis of Life
 Deoxyribonucleic Acid (DNA)
 Double stranded with complementary strands A-T, C-G
 DNA is a polymer
 Sugar-Phosphate-Base
 Bases held together by H bonding to the opposite strand

RNA
 RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by
U(racil)
 Some forms of RNA can form secondary
structures by“pairing up” with itself. This can
have impact on its properties dramatically.
DNA and RNA
can pair with
each other.http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:

RNA, continued
Several types exist, classified by function:
 hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary
transcipts with introns that have not yet been excised (pre-mRNA).
 mRNA: this is what is usually being referred to when a
Bioinformatician says “RNA”. This is used to carry a gene’s
message out of the nucleus.
 tRNA: transfers genetic information from mRNA to an amino acid
sequence as to build a protein
 rRNA: ribosomal RNA. Part of the ribosome which is involved in
translation.

Transcription Transcription is highly regulated. Most DNA is in a
dense form where it cannot be transcribed.
 To start, transcription requires a promoter, a small
specific sequence of DNA to which polymerase can
bind (~40 base pairs “upstream” of gene)
 Finding these promoter regions is only a partially
solved problem that is related to motif finding.
 There can also be repressors and inhibitors acting in
various ways to stop transcription. This makes
regulation of gene transcription complex to
understand.

Definition of a Gene
 Regulatory regions: up to 50 kb upstream of +1 site
 Exons: protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)
8 bp to 17 kb per exon (mean 145 bp)
 Introns: splice acceptor and donor sites, junk DNA
average 1 kb – 50 kb per intron
 Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.

Splicing and other RNA processing
 In Eukaryotic cells, RNA is processed between
transcription and translation.
 This complicates the relationship between a DNA
gene and the protein it codes for.
 Sometimes alternate RNA processing can lead to an
alternate protein (splice variants) as a result. This is
true in the immune system.

Proteins: Crucial molecules
for the functioning of life
• Structural Proteins: the organism's basic building blocks, eg. collagen,
nails, hair, etc.
• Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a single type
of reaction, but they can play a role in more than one pathway.
• Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small molecules from
the extracellular environment and generation of ionic gradients essential for
muscle and nerve cell function (sodium/potasium pump is an example)
• Proteins are polypeptide chains, constructed by joining a certain kind of
peptides, amino acids, in a linear way
• The chain of amino acids, however folds to create very complex 3D
structures

Translation
 The process of going
from RNA to
polypeptide.
 Three base pairs of
RNA (called a codon)
correspond to one
amino acid based on
a fixed table.
 Always starts with
Methionine and ends
with a stop codon

Protein Structure: Introduction
 Different amino acids
have different properties
 These properties will
affect the protein
structure and function
 Hydrophobicity, for
instance, is the main
driving force (but not
the only one) of the
folding process

Protein Structure: Hierarchical nature of protein
structure
MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL
PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE
KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK
HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL
IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE
Primary Structure = Sequence of amino acids
Secondary Structure Tertiary
Local Interactions Global Interactions

Protein Structure: Why is structure
important?
 The function of a protein depends greatly on its
structure
 The structure that a protein adopts is vital to it’s
chemistry
 Its structure determines which of its amino acids are
exposed to carry out the protein’s function
 Its structure also determines what substrates it can
react with

Protein Structure: Mostly lacking
information
 Therefore, it is clear that knowing the structure of a
protein is crucial for many tasks
 However, we only know the structure for a very small
fraction of all the proteins that we are aware of
 The UniProtKB/TrEMBL archive contains 23165610
(16886838) sequences
 The PDB archive of protein structure contains only
84223(76669) structures
 In the native state, proteins fold on its own as soon as
they are generated, amino-acid by amino-acid (with
few exceptions e.g. chaperones)  can we predict this
process as to close the gap between protein sequences
and their 3D structures?

Central Dogma of Biology: A Bioinformatics
Perspective
The information for making proteins is stored in DNA. There is
a process (transcription and translation) by which DNA is
converted to protein. By understanding this process and how it
is regulated we can make predictions and models of cells.
Sequence analysis
Gene Finding
Protein
Sequence/Stru
cture Analysis
Assembly
Computational Problems

Information flow in bioinformatics
 Data enters the “bioinformatics scope” when a scientist
deposits an experimental result in an appropriate archive
 The archive curates and annotates the data
 The data is released to the public
 Afterwards, the data may be retrieved/analysed:
 Integrating the new entry into a search engine
 Extracting useful subsets of the data
 Deriving new types of information from the data
 Aggregating the data, by homology, function, structure
 Reannotating the data with new discovered/inferred info.
 Quality of data depends on many factors, the techniques used
to experimentally create the data, degree of inference and
prediction involved in the annotation process, etc.
 Many publicly available databases:
http://en.wikipedia.org/wiki/List_of_biological_databases

NCBI’s Entrez system
http://www.ncbi.nlm.nih.gov/
Entrez is a search and retrieval system that integrates
information from databases at NCBI (National Center for
Biotechnology Information).

Uniprot http://www.uniprot.org
 The Universal Protein Resource (UniProt) is a collaboration between
the European Bioinformatics Institute (EBI), the SIB Swiss Institute of
Bioinformatics and the Protein Information Resource (PIR)

KEGG - http://www.genome.jp/kegg/
 Not just about
genes/proteins but
also pathways, that is,
their interactions

DAVID - http://david.abcc.ncifcrf.gov/

Sequences
 Be it DNA, RNA or proteins we have many data that
can be represented as sequences of a certain alphabet
 Many generic algorithms to deal with biological
sequences exist
 Sequence alignment
 Motif representation

Sequence Alignment Is the assignment of residue-residue correspondences
between nucleotide/proteomic sequences
Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY
Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL
Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107
...
Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ + C P+
Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335
Query 361 S-PSVN 365
P VN
Sbjct 336 KFPGVN 341
gap
matches
mismatches

Motivation
 Similarity is expected among biomolecules that are
descended from a common ancestor.
 Mutations cause differences, but survival of the organism requires that
mutations occur in regions that are less critical to function
 Important catalytic, regulatory or structural regions remain similar
 An alignment between two or more genetic or proteomic
sequences represents an explicit hypothesis via their
evolutionary histories.
 Thus comparison of related gene/protein sequences have
been instrumental in shedding light into the information
content of these sequences and their biological functions.

Definition and aims
 Why align sequences?
1. Start with a query sequence with unknown properties
and search within a database of millions of sequences to
find those which share similarity with the query.
2. Start with a small set of sequences and identify
similarities and differences among them.
3. In many sequences or very long sequences, detect
commonly occurring patterns

Similarity vs. Homology
 Similarity is the observation or measurement of resemblance
and difference, independent of the source of resemblance.
 There are many examples of different organisms with
functionally similar organs that came from distinc evolutionary
origins
 When similarity is due to a common ancestry, we call it
homology.
 Sequence alignment helps inferring homology hypothesis:
 If two sequences are very similar, it is probable that there is a common
origin
 Therefore, if we know some information (structure, function) from
sequence X, and sequence X is similar to sequence Y, it is probable that
the same information applies to Y

Metrics of similarity: Definitions
 Gap: a break in the alignment, in either one of the
sequences.
 For nucleotides, a consequence of an insertion or deletion
mutation.
 For proteins, it’s more difficult to say.
 Regions of matching residues.
 Indicate parts of a sequence that are well conserved
 Mismatched residues.
 For nucleotides, a consequence of a substitution mutation
 Less conserved regions

Metrics of similarity: Distance scoring
 Distance scoring
 Given an alignment with matches, mismatches and
gaps, we compute a score following:
 For each mismatch, score is increased by 2
 For each gap, score is increased by 4
 For each match, no increase in score
 Higher score, less similarity
 Equivalent metrics exist for similarity (not
distance) where higher score means good
similarity
= 18
A – G C C G T A T
A C G A - - T - T
0 4 0 2 4 4 0 4 0

Metrics of similarity: Mismatches and gaps
 Are all mismatches equally bad?
 For protein sequences, there are several subgroups of amino
acids with similar properties. Mismatches within a group have
less impact
 For nucleotide sequences, transition mutations (a↔g and t↔c)
are more common than transversions (a or g ↔ t or c) mutations
 Distance scoring of mismatches could be smarter  substitution
matrices
 Using statisical analysis on large corpus of real sequences to generate better
scores
 How to penalize gaps
 Each gap slot gets equal distance score
 One score to open a gap, another (smaller) score to extend the
same gap

Global vs Local alignment
 We know how to score good or bad alignments
 How to find the optimal one?
 Two classes of alignment methods
 Global alignment
 Finds the best alignment of one entire sequence with another
entire sequence
 Local alignment
 Find the best alignment of one segment of a sequence against
another segment of another sequence

Exact vs. Approximate methods
 Exact methods for both global and local alignment exist, based
on dynamic programming, but are slow
 Good enough when there are few sequences
 Not so good when comparing a target sequence to a database of
millions of known sequences
 Approximate methods have been used for many years for large-
scale alignment tasks
 They use some kind of heuristic to speed up the alignment process
 BLAST (Basic Local Alignment Search Tool) is the most famous
approximate method
 It identifies potential hits by looking for perfect matches of very small sub-sequences
(seeds)
 It only tries to create a full alignment for sequences where several seeds are identified
 PSI-BLAST: version that takes into account that multiple hits are identified. It
constructs a tailored substitution matrix based on hits and then refines the alignment

Multiple Sequence Alignment
 When we have to align more than two sequences
 Progressive methods (e.g. ClustalW)
 Start with seed alignment
 Iteratively incorporate other alignments to seed, without
modifying what is aligned so far
 ClustalW uses phylogenetic trees (representations of the
evolutionary relationship between sequences) to
progressively construct MSA
 Iterative methods (e.g. MUSCLE)
 Can re-edit the partial MSA based on the newly
incorporated alignments

Motifs
 When visualising a MSA we can see regions of high
agreement and regions of low agreement.
 The high agreement regions define that a certain
protein belongs to a family
 What if we concentrate on modelling and identifying
these regions instead of the whole sequences  Motif
finding

DNA
 Coding/non coding
 SNPs
 Copy number variation
 Assembly
 Methylation
 Primer design

Coding/Non Coding
 Identifying the regions from an organism’s genome
that contain genes
 Many different factors involved in this identification
 Promoter identification
 Long enough Open Reading Frames (ORF)
 Splice variants
 Introns/Exons (in Eucaryotes)
 Statistical properties of gene-coding DNA
 HMM are also used for gene finding

Single Nucleotide Polymorphisms
(SNPs)
 One base-pair variation in DNA
 In most cases in non-coding regions of DNA, but not
always
 When frequent enough in a population they can be
linked to specific traits, e.g. a disease
 SNP microarrays can be used to probe hundreds of
thousands of SNPs in parallel
 In reality few SNPs act on their own
 Genome-Wide Association Studies identify groups of
SNPs linked to a certain condition

Copy Number Variation
 In general two copies of each gene exist in a genome
 It may be the cases than more/less than two copies
exist of a certain gene for a specific sub-population
 It has been suggested that certain CNV can be linked
to specific diseases

Genome assembly
 Sequencing technologies are able to read (sequence) a
complete genome as a series of short overlapping
fragments
 How to assemble back all these fragments?
 Greedy approach
 Pair-wise alignments of all fragments
 Merge fragments of largest overlap
 Keep iterating until all segments are merged
 Worked more or less well on old sequencing technologies,
not so well on next-generation sequencing data, due to
smaller fragment sizes and larger error rate

Genome mapping
 Given a large set of short fragments, as a result of next-
generation sequencing, map them to a reference
genome
 Different from previous one. We do not want to
reconstitute a complete genome, just identify to which
genes each fragment belongs (among other
applications).
 Speed is an issue
 Modern methods (e.g. SOAP2) compress the genome
and are able to align the fragments in the compressed
space

Methylation
 It is a chemical reaction that can block a certain region
of a chromosome, preventing its transcription
 The process can be reverted, so essentially it is an
on/off switch of the affected gene
 Specialised microarrays exist for the high-throughput
detection of methylated genes
 Afterwards, data analysis can take place

DNA library specification
• A DNA library is a combinatorial set of DNA sequences suited to
manufacture via DNA reuse
• The first stage towards the creation of a DNA library is the formal
specification of the target DNA molecules that comprise it
• A set of sequences does not convey the intention behind the library
Key challenge is to enable precise
editing of DNA sequences in an
extensible and reproducible
manner whilst avoiding manual
handling of these unwieldy
objects

DNALD library format
 A DNALD library consists of three sets of definitions:
inputs, intermediates and outputs, with different
semantics
 Inputs: existing DNA sequences to be provided with design
 Intermediates: conceptual means of factoring commons seqs
 Outputs: to be produced through DNA reuse

DNALD expressions
 A DNALD expression is a combination of explicit sequences,
definition names, operators and functions that are interpreted
according to rules of precedence and association ("evaluated") to
produce a set of DNA sequences.
 Definitions bind names to the results of expressions.

Workbench interface
text editor with:
• syntax highlighting
• auto-completion
• code folding
• etc.
manage
projects
viewed from different
perspectives

CADMAD’s DNALD (DNA Library
Design)
A specification language that
produces a set of target DNA
sequences as a function of
operations on a set of inputs
To maximise CADMAD's impact the specification process must be:
 user friendly and debuggable
 but expressively powerful enough to:
 define non-trivial combinatorial constructs
 communicate degrees of freedom
>Ret_human
GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA
CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT
GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC
CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC
TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC
CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC
TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA
CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT
GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG
GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC
GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA
GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC
TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse
GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA
CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT
GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC
AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC
TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA
AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC
TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG
ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA
CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT
ACTGAAGAAGTAGGCT
>Ret_zebrafish
GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT
CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT
GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT
TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT
CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT
CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT
ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT
TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC
GAAGCAGATTCACTGAAAACGGATG
>Ret_chicken
GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG
CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG
TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA
GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT
CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC
TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT
GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC
TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC
AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG
CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG
AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA
GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC

RNA
 Expression
 Structure prediction

RNA expression
 Not all genes are transcribed/translated into proteins all
the time
 The expression of genes is highly sophisticated and
depends on many factors
 Identifying the genes being expressed in a given point of
time in a specific tissue provides crucial information about
the roles and interactions of such genes
 Compare the genes expressed between different groups of
samples to identify those that are differentially expressed
 Identify co-expressed genes, that present patterns of
correlation

Measuring RNA expression
 RT-PCR (Real-time reverse polimerase chain reaction)
 Measures accurately the expression of a pre-determined
gene
 RNA Microarrays
 Measures, in parallel, the expression of tens of
thousands of genes
 RNA-Seq
 The next-generation sequencing variant for measuring
gene expresison

RNA Structure prediction
 A RNA sequence can bind with itself to create complex
shapes with a certain pattern of loops
 Can we predict, from a given sequence, the structural
shape of the RNA?

Proteins
 Protein classification
 Structure prediction
 Structure comparison
 Function and interaction

Protein classification
 Proteins can be annotated in many different ways
 Function
 DNA-binding? Enzyme?
 Tissue/Cellular/Sub-cellular localisation
 Interacting with other proteins?
 Can we predict this annotation using ML?
 We need to transform the protein sequence into a uniform
representation of equal size for all proteins
 Many different representations exist
 Several of these problems can be modelled as a hierarchical
classification problem

Protein Structure Prediction PSP aims to predict the 3D structure of a protein
based on its primary sequence

Protein Structure Prediction
 PSP is an open problem. The 3D structure
depends on many variables
 It has been one of the main holy grails of
computational biology for many decades
 Impact of having better protein structure models
are countless
 Genetic therapy
 Synthesis of drugs for incurable diseases
 Improved crops
 Environmental remediation

Prediction types of PSP There are several kinds of prediction problems within
the scope of PSP
 The main one, of course, is to predict the 3D coordinates of
all atoms of a protein (or at least the backbone) based on
its primary sequence
 There are many structural properties of individual residues
within a protein that can be predicted, for instance:
 The secondary structure state of the residue
 If a residue is buried in the core of the protein or exposed in the
surface
 Accurate predictions of these sub-problems can simplify the
general 3D PSP problem

3D Protein Structure Prediction
 Some PSP methods try to find similar proteins and then
adapt the structure of the homolog (template) to the
target protein  Homology Modeling
 Other methods try to find the structure of the protein
from scratch (Ab Initio Modelling), optimizing some
energy function that models the stability of the protein,
in case no homolog can be identified
 In between there are other kind of methods, for varying
degrees of good homology of our target, for instance,
Fold Recognition or Threading
• These methods identify a target based on more than
homology (i.e. sequence alignment).

Coordination Number PredictionTwo residues of a chain are said to be in contact if their
distance is less than a certain threshold (e.g. 8Å)
CN of a residue : count of contacts that a certain
residue has
CN gives us a simplified profile of the density of packing
of the protein
ContactPrimary
Sequence
Native State

Contact Map prediction Prediction, given two residues
from a chain, whether these two
residues are in contact or not
 This problem can be represented
by a binary matrix. 1= contact, 0 =
non contact
 Plotting this matrix reveals many
characteristics from the protein
structure
 Very sparse characteristic: Less
than 2% of contacts in native
structures
helices sheets

Other predictions Other kinds of residue
structural aspects that can be
predicted
 Solvent accessibility: Amount of
surface of each residue that is
exposed to solvent
 Recursive Convex Hull: A metric
that models a protein as an
onion, and assigns each residue
to a layer. Formally, each layer is
a convex hull of points
 These features (and
others) are predicted in a
similar was as done for SS

Protein Structure Comparison
 Protein Structure Comparison (PSC) aims at
 Assess the degree of similarity between protein structures
 Given a query structure, identify other proteins with similar
structure
 Why?
 Group proteins by structural similarities
 Determine the impact of individual residues on the protein
structure
 Identify distant homologues of protein families
 Predict function of proteins with low degree of primary structure
(i.e.. sequence) similarity with other proteins
 Engineer new proteins for specific functions
 Assess ab-initio predictions

Protein Structure Comparison
 Sequence-Structure-Function relationships
1) Conserved 1º sequences similar structures
2) Similar structures conserved 1º sequences
3) Similar structures conserved function
 PSC shares many similarities with sequence alignment.
Our aim is to infer new knowledge from the
comparison process
?

Protein Structure Comparison Existing Approaches
 SSAP (Orengo & Taylor, 96)
 ProSup (Feng & Sippl, 96)
 DALI (Holm & Sander, 93)
 CE (Shindyalov & Bourne, 98)
 LGA (Zemla, 2003)
 SCOP (Murzin, Brenner, Hubbard & Chothia, 95)
 CATH (Orengo, Mithie, Jones, Jones, Swindells &
Thornton, 97)
 ProCKSI – Consensus of multiple PSC methods

Prediction of Protein Function
 In an ideal world, the cascade of inference should flow
from sequence  structure  function
 That is, if we can identify similar sequences of structures
to our query target we can (at varying degrees of
certainty) infer that they have similar function

 As proteins evolve, they may
 Retain function and specificity
 Retain function but alter specificity
 Change to a related function, or a similar function in a
different metabolic contxt
 Change to a completely unrelated function
 How much must a protein change before the
function changes?
 Sometimes, not at all. There are many cases of
proteins with different functions in different
environments

 Thus, sequence or structure similarity is not always
reliable to assign function
 Other ways of determining protein function
 By identifying patterns of co-regulated genes
 Using data from Microarray experiments
 By identifying protein-protein interactions

 A related question is: where is the function of a protein
taking place?  active site
 Several methods exist to predict active/binding sites of
proteins from local patterns of sequence or structure
 A raw way of doing this prediction is to take a look at the
conserved residues of a sequence  they may be related
to either the core of the protein (structural stability) or
the function of a protein (a change of function is a risk for
survival)
 More sophisticated methods exists to learn how to
predict active sites. They use ML, in a similar way used to
predict residue structural features in PSP
 Still, it is a very tough problem, and ML methods are not
much better than blast-based methods

Three case studies
 Mining –omics data
 Predicting structural aspects of protein residues
 Automated alphabet reduction for protein datasets
 In all these three case studies we use the same
evolutionary learning system: BioHEL [Bacardit et al.,
09]

BioHEL BioHEL [Bacardit et al., 09] is an evolutionary
learning system that applies the Iterative Rule
Learning (IRL) approach
 Designed explicitly to deal with noisy large-scale
datasets
 IRL was first used in EC by the SIA system
[Venturini, 93]

BioHEL’s learning paradigm IRL has been used for many years in the ML community,
with the name of separate-and-conquer

BioHEL’s objective function An objective function based on the Minimum-
Description-Length (MDL) (Rissanen,1978) principle
that tries to promote rules with
 High accuracy: not making mistakes
 High coverage: covering as much examples as possible
without sacrificing accuracy. Recall (TP/(TP+FN)) will be
used to define coverage
 Low complexity: rules as simple and general as possible
 The objective function is a linear combination of the three
objectives above

BioHEL’s objective function
 Intuitively, we would like to have accurate rules covering
as much examples as possible.
 However, in complex and inconsistent domains it is rare
to obtain such rules
 In these cases, easier path for evolutionary search is to
maximize accuracy at the expense of coverage
 Therefore, we need to enforce that the evolved rules cover
enough examples

BioHEL’s objective function
 Three parameters define the shape of the function
 The choice of the coverage break is crucial for the proper performance of
the system
 Also, coverage term penalizes rules that do not cover a minimum
percentage of examples or that cover too many

BioHEL’s characteristics Attribute list rule representation
 Automatically identifying the relevant attributes for a given rule and
discarding all the other ones
 The ILAS windowing scheme
 Efficiency enhancement method, not all training points are used for
each fitness computation
 An explicit default rule mechanism
 Generating more compact rule sets
 Iterative process terminates when it is impossible to evolve a rule
where the associated class is the majority class among the matched
examples
 At this point, all remaining training instances are assigned to the default
class

Mining –omics data
 Biological data can be generated at many different
levels
 Genomics (DNA)
 Transcriptomics (RNA)
 Proteomics (proteins)
 Metabolomics (small compounds)
 Lipidomics (lipids)
 Hundreds of –omics have been catalogued

How an –omics dataset looks like?
 In most cases datasets present a similar structure
 Each sample is characteristed by a large number of
variables (RNA, Proteins, lipids, etc.)
 Each variable indicates (usually quantitatively) the
presence of that element in the sample
 Due to the high cost of most –omics technologies,
variables >> samples
 Problems of over-fitting

What can we do with the dataset?
 In most cases, samples are annotated with a
qualitative label
 Cancer/Non-cancer patients
 Samples of seed tissue for which it is known if the seed
germinated or not
 Age of the sample
 Therefore, we can treat these datasets as
classification problems, and generate prediction
models from the data
 Not just as classification problems
 Clustering/Biclustering
 Association Rule Mining
 Regression

But in most cases, domain experts
are not (only) interested in
predictions
 Biomarker identification
 Identify the key variables
 Most strongly associated to each outcome
 Using e.g. t-tests to identify those
 Presenting higher prediction capacity
 As identified by ML methods
 Identify interactions between variables
 By presenting very high (anti)correlation between them
 By acting together to generate predictions

Functional Network Reconstruction
for seed germination Microarray data obtained from seed tissue of Arabidopsis
Thaliana
 122 samples represented by the expression level of
almost 14000 genes
 It had been experimentally determined whether each of
the seeds had germinated or not
 Can we learn to predict germination/dormancy from the
microarray data?
 [Bassel et al., 2011]

Generating rule sets
 BioHEL was able to predict the
outcome of the samples with
93.5% accuracy (10 x 10-fold cross-
validation
 Learning from a scrambled dataset
(labels randomly assigned to
samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict
germination
germination
germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and
At1g48320>56.80  Predict germination
Everything else  Predict dormancy

Identifying regulators
 Rule building process is stochastic
 Generates different rule sets each time the system is run
 But if we run the system many times, we can see some
patterns in the rule sets
 Genes appearing quite more frequent than the rest
 Some associated to dormancy
 Some associated to germination

Known regulators appear with high
frequency in the rules

Generating co-prediction networks of
interactions
• For each of the rules shown before to be
true, all of the conditions in it need to be
true at the same time
– Each rule is expressing an interaction between
certain gens
• From a high number of rule sets we can
identify pairs of genes that co-occur with
high frequency and generate functional
networks
• The network shows different topology
when compared to other type of network
construction methods (e.g. by gene co-
expression)
• Different regions in the network contain
the germination and dormancy genes

Experimental validation
 We have experimentally verified this analysis
 By ordering and planting knockouts for the highly ranked
genes
 We have been able to identify four new regulators of
germination, with different phenotype from the wild type

Prediction of structural aspects of protein
residues
 Many of these features are due to local interactions of an amino
acid and its immediate neighbours
 Can it be predicted using information from the closest
neighbours in the chain?
 In this simplified example to predict the SS state of residue i we
would use information from residues i-1 i and i+1. That is a
window of ±1 residues around the target
Ri
SSi
Ri+1
SSi+1
Ri-1
SSi-1
Ri+2
SSi+2
Ri-2
SSi-2
Ri+3
SSi+3
Ri+4
SSi+4
Ri-3
SSi-3
Ri-4
SSi-4
Ri-5
SSi-5
Ri+5
SSi+5
Ri-1 Ri Ri+1  SSi
Ri Ri+1 Ri+2  SSi+1
Ri+1 Ri+2 Ri+3  SSi+2

ARFF file for a simple PSP dataset
@relation AA+CN_Q2
@attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
@attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
@attribute class {0,1}
@data
X,X,X,X,A,E,I,K,H,0
X,X,X,A,E,I,K,H,Y,0
X,X,A,E,I,K,H,Y,Q,0
X,A,E,I,K,H,Y,Q,F,0
A,E,I,K,H,Y,Q,F,N,0
E,I,K,H,Y,Q,F,N,V,0
I,K,H,Y,Q,F,N,V,V,0
K,H,Y,Q,F,N,V,V,M,1
H,Y,Q,F,N,V,V,M,T,0
Y,Q,F,N,V,V,M,T,C,1

What information do we include for each
residue?
 Early prediction methods used just the primary sequence
 the AA types of the residues in the window
 However the primary sequence has limited amount of
information
 It does not contain any evolutionary information it does not
say which residues are conserved and which are not
 Where can we obtain this information?
 Position-Specific Scoring Matrices which is a product of a
Multiple Sequence Alignment

Position-Specific Scoring Matrices (PSSM)
– For each residue in the query sequence compute
the distribution of amino acids of the corresponding
residues in all aligned sequences (discarding those
too similar to the query)
– This distributions will tell us which mutations are
likely and which mutations are less likely for each
residue in the query sequence
– In essence it’s similar to a substitution matrix but
tailored for the sequence that we are aligning
– A PSSM profile will also tell us which residues are
more conserved and which residues are more
subject to insertions or deletions

PSSM for the 10 first residues of 1n7lAA R N D C Q E G H I L K M F P S T W Y V
A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0
M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1
E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3
K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3
V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5
Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3
Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2
L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1
T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3

Secondary Structure Prediction
– The most usual way is to predict whether a
residue belongs to an α helix a β sheet or is in coil
state
– Several programs can determine the actual SS
state of a protein from a PDB file. The most
common of them is DSSP
– Typically, a window of ±7 amino acids (15 in total)
is used. This means 300 attributes (when using
PSSM).
– A dataset with 1000 proteins with
~250AA/protein would have ~250000 instances

Secondary Structure Prediction
R1 R2 R3 Rn-1 Rn
Primary sequence
MSA
PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn
PSSM profile of sequence
Windows
generation
PSSMi-1 PSSMi PSSMi+1
Prediction
method
SSi?
Window of PSSM profilesPrediction

Other prediction problems
 This same structure of prediction can be applied to most
1D structural aspects
 However, many of these features are natively continuous
measures (or integer)
 To treat these problems as classification problems, we
need to discretise the output
 Unsupervised methods are applied
 Uniform length and uniform frequency disc.
UL
UF

PSP datasets are good ML benchmarks
 These problems can be modelled in may ways:
 Regression or classification problems
 Low/high number of classes
 Balanced/unbalanced classes
 Adjustable number of attributes
 Ideal benchmarks !!
 http://icos.cs.nott.ac.uk/datasets/psp_benchmark.ht
ml

Contact Map Prediction
 We participated in the CASP9 competition
 CASP = Critical Assessment of Techniques for Protein Structure Prediction.
Biannual competition
 Every day, for about three months, the organizers release some protein
sequences for which nobody knows the structure (129 sequences were
released in CASP9, in 2010)
 Each prediction group is given three weeks to return their predictions
 If the machinery is not well oiled, it is not feasible to participate !!
 For CM, prediction groups have to return a list of predicted contacts (they
are not interested in non-contacts) and, for each predicted pair of
contacting residues, a confidence level

Contact Map prediction Prediction given two residues
from a chain whether these
two residues are in contact or
not
 This problem can be
represented by a binary matrix.
1= contact 0 = non contact
 Plotting this matrix reveals
many characteristics from the
protein structure
helices sheets

Steps for CM prediction (Nottingham
method)
1. Prediction of
 Secondary structure (using PSIPRED)
 Solvent Accessibility
 Recursive Convex Hull
 Coordination Number
2. Integration of all these predictions plus other sources of
information
3. Final CM prediction (using BioHEL)
Using BioHEL [Bacardit et al., 09]

Prediction of RCH, SA and CN
 We selected a set of 3262 protein chains from PDB-
REPRDB with:
 A resolution less than 2Å
 Less than 30% sequence identify
 Without chain breaks nor non-standard residues
 90% of this set was used for training (~490000 residues)
 10% for test

Prediction of RCH, SA and CN
 All three features were predicted based on a window of
±4 residues around the target
 Evolutionary information (as a Position-Specific Scoring
Matrix) is the basis of this local information
 Each residue is characterised by a vector of 180 values
 The domain for all three features was partitioned into 5
states

Characterisation of the contact map
problem
 Three types of input information were used
1. Detailed information of three different windows of
residues centered around
 The two target residues (2x)
 The middle point between them
2. Information about the connecting segment between the
two target residues and
3. Global protein information.
1
2
3

Contact Map dataset
 From the original set of 3262 proteins we kept all that
had <250 AA and a randomly selected 20% of larger
proteins
 Still, the resulting training set contained 32 million pairs
of AA and 631 attributes
 Less than 2% of those are actual contacts
 +60GB of disk space

Samples and ensembles
 50 samples of 660K examples are
generated from the training set with a
ratio of 2:1 non-contacts/contacts
 BioHEL is run 25 times for each sample
 Prediction is done by a consensus of
1250 rule sets
 Confidence of prediction is computed
based on the votes distribution in the
ensemble.
 Whole training process took about 25K
CPU hours
Training set
x50
x25
Consensus
Predictions
Samples
Rule sets

Contact Map prediction in CASP
 Predictor groups are asked to submit a list of
predicted contacts and a confidence level for each
prediction
 The assessors then rank the predictions for each
protein and take a look at the top L/x ones, where L is
the length of the protein and x={5,10}
 From these L/x top ranked contacts two measures are
computed
 Accuracy: TP/(TP+FP)
 Xd: difference between the distribution of predicted
distance and a random distribution

CASP9 results
These two groups derived contact
predictions from 3D models
http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf

Understanding the rule sets
 Each rule set has in average 135 rules
 We have a total of 168470 rules
 Impossible to read all of them individually, but we can
extract useful statistics
 For instance, how often was each attribute used in the
rules?
 Full analysis

Distribution of frequency of use of
attributes
 All 631 attributes are
actually used (min
frequency=429)
 However, some of
them are used much
more frequently than
others

Top 10 attributesAttribute Frequency Count
s
PredSS_r1_1 1.48% 18141
PredCN_r1 1.66% 20336
propensity 1.74% 21288
PredSS_r2 1.75% 21350
PredSS_r1 1.82% 22205
PredRCH_r2 1.87% 22856
PredRCH_r1 2.04% 24961
PredSA_r2 2.12% 25891
PredSA_r1 2.39% 29246
separation 4.17% 50951
The four kind of residue’s predictions are highly ranked

Motivation PSP is a very costly process
 As an example, one of the best PSP methods CASP8,
Rosetta@Home could dedicate up to 104 computing
years to predict a single protein’s 3D structure
 One of the possible ways to alleviate this
computational cost is to simplify the representation
used to model the proteins

Target for reduction: the primary sequence
 The primary sequence of a protein is
an usual target for such simplification
 It is composed of a quite high cardinality
alphabet of 20 symbols, which share
commonalities between them
 One example of reduction widely used in
the community is the hydrophobic-polar
(HP) alphabet, reducing these 20 symbols
to just two
 HP representation usually is too simple,
too much information is lost in the
reduction process [Stout et al., 06]
 Can we automatically generate these
reduced alphabets and tailor them to
the specific problem at hand?

Automated Alphabet Reduction
[Bacardit et al., 09]
• We will use an automated information theory-driven
method to optimize alphabet reduction policies for PSP
datasets
• An optimization algorithm will cluster the AA alphabet into
a predefined number of new letters
• Fitness function of optimization is based on the Mutual
Information (MI) metric. A metric that quantifies the
interrelationship between two discrete variables
– Aim is to find the reduced representation that maintains as much
relevant information as possible for the feature being predicted
• Afterwards we will feed the reduced dataset into a
learning method to verify if the reduction was proper

Alphabet Reduction protocol
130
Dataset
Card=20
ECGA
Mutual
Information
Size = N
Dataset
Card=N
BioHEL
Test set
Accuracy
Ensemble
of rule sets

 Competent 5-letter alphabet (similar performance to
the AA alphabet)
 Different alphabets for CN and SA domains
 Unexpected explanations: Alphabet reduction
clustered AA types that experts did not expect

 Our method produces better reduced alphabets than other
reduced alphabets from the literature and than other expert-
designed ones
Alphabets
from the
literature
Expert
designed
alphabets
Alphabet Letters CN acc. SA acc. Diff. Ref.
AA 20 74.0±0.6 70.7±0.4 --- ---
Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07]
WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99]
SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00]
MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00]
MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06]
HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07]
HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07]
HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07]

Efficiency gains from the alphabet
reduction
 We have extrapolated the reduced alphabet to the much
larger and richer Position-Specific Scoring Matrices (PSSM)
representation
 Accuracy difference is still less than 1%
 Obtained rule sets are simpler and training process is much
faster
 Performance levels are similar to recent works in the literature
[Kinjo et al., 05][Dor and Zhou, 07]
 Won the bronze medal of the 2007 Humies awards

Conclusions
 Bioinformatics contain many challenges that computer
science can tackle
 Optimisation
 Machine learning
 Software engineering
 Evolutionary computation has shown to be very
competitive across a large range of bioinformatics
problems
 Facing these challenges for EC has led to the
development of many new methods

References/Bibliography Journals
 The Bioinformatics Journal
 BMC Bioinformatics
 BMC Biodata Mining
 Bioinformatics books
 Introduction to Bioinformatics by Arthur Lesk, Oxford University Press.
 Introduction to Bioinformatics. A. Tramontano, Chapman and
Hall/CRC
 Specialised topics
 Bioinformatics for –omics data. Methods and Protocols. Bernd
Mayer (ed). Springer
 Next-Generation Sequencing special issue of the Bioinformatics
Journal;
http://www.oxfordjournals.org/our_journals/bioinformatics/nextge
nerationsequencing.html

References/Bibliography
 J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number
prediction using Learning Classifier Systems: Performance and interpretability. In
Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation
(GECCO2006), pp. 247-254, ACM Press, 2006
 Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull
Class Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008
 Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological
Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal,
13(3):245-258, 2009
 J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based
evolutionary learning. Memetic Computing journal 1(1):55-67, 2009
 J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated
Alphabet Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009
 George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume
Bacardit. Functional Network Construction in Arabidopsis Using Rule-Based Machine
Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011
 J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the
fusion of multiple predicted structural features. Bioinformatics first published online
July 25, 2012 doi:10.1093/bioinformatics/bts472

References/Bibliography
 Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies
Bioinformatics (2010) 26(4): 445-455
 Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and
sequence based descriptors for protein classification, Journal of Theoretical Biology
266(1):1-10, 2010
 Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for
protein function prediction, Memetic Computing 2(3):165-181, 2010
 Daniel Barthel et al., Procksi: a decision support system for protein (structure)
comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007
 http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics
 Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with
Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-
602.
 Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-
gene associations from Quantitative Association Rules In: 11th International Conference
on Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246
 Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano,
Yves Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of

Acknowledgements• Prof. Natalio Krasnogor
• Prof. Michael Holdsworth
• Prof. Jonathan Hirst
• Dr. Michael Stout
• Dr. George Bassel
• Dr. Enrico Glaab
• Dr. Pawel Widera
• EPSRC GR/T07534/01 & EP/H016597/1
• EU FP7 CADMAD project

Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk

Introduction

More Related Content

What's hot

Viewers also liked

Similar to Introduction

Recently uploaded

Introduction