Introduction to Bioinformatics


Published on

Tutorial given at the PPSN2012 conference in Taormina, Italy

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Definitions consist of a name and DNALD expression.Inputs to be defined using a subset of DNALD expressions: unambiguous nucleotide sequencesthe reverse and/or complement of thosethe imported outputs of other DNALD libraries (facilitating iterative library consumption)standard sequence formats
  • Unary operations include subsequence extraction and mutation. These can be chained together, each operating on the result of the previous one.Binary operations include concatenation, repetition, and unions.Functions include reverse, complement and back-translation.Sequences of nucleotides and amino acids are quoted strings containing single letters symbols according to the IUPAC nomenclature (and numbers which are ignored)Amino acid sequences are only expected in the context of back-translationsAmbiguous nucleotides expand to set of unambiguous alternatives (within reason: 10×N=410=106 sequences)Circular sequences defined by parenthesised overlap at 3'-end: 'ACGT…(AC)'Reverse and complement functions: reverse(complement('ATAGAGTAG'))Repetition operation is multiplication of an expression by either a positive integer or a range of positive integers creating a set: 'A'*3 -> 'AAA', 'A'*(2:4) -> {'AA', 'AAA', 'AAAA'}Back-translation returns the set of DNA sequences that could encode an amino acid sequence using a particular codon tableThe complete set of sequences will likely be unfeasibly large so must be handled appropriatelyVarious strategies for sampling the space of possible sequences will be developed and algorithms such as GeneOptimizer will be incorporated if source available or reimplemented from description if possibleUser-defined constraints yet to be formalised
  • Introduction to Bioinformatics

    1. 1. Introduction to Bioinformatics Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham
    2. 2. About me• Did my PhD in evolutionary learning• Postdoc in Protein Structure Prediction 2005- 2007• Since 2008 lecturer in Bioinformatics at the University of Nottingham• Research interests – Large-scale data mining – Biological data mining
    3. 3. Outline• What is Bioinformatics?• Basic molecular biology• Public databases• Sequence analysis• The scales of bioinformatics• Biological data mining
    5. 5. What is Bioinformatics?• Several definitions exist. Michael Liebman proposed a quite elegant definition: – “The study of the information content and information flow in biological systems and processes‖ (Michael Liebman) – Information content: genome project – Information flow: molecular transport – Biological systems: cells, organisms, … – Biological processes: metabolic networks• Bioinformatics is the science of using information to understand aspects of Biology. That is, a discipline where techniques such as applied mathematics, computer science, statistics, artificial intelligence, etc. are integrated to solve biological problems
    6. 6. Information, information, information• As we know there have been major advances in the field of molecular biology• These have been coupled with advances in laboratory (post)genomic technology• This has led to an explosive growth in the collection of biological information• This deluge of information has led to an absolute requirement for 1. Computerized databases to store, organise and index the data 2. For specialized tools to view and analyse the data 3. Specialized tools to infer new knowledge from the data
    7. 7. Areas of research(taxonomy of the Bioinformatics Journal)• Genome Analysis• Sequence Analysis• Phylogenetics• Structural Bioinformatics• Gene Expression• Genetics and Population Analysis• Systems Biology• Data and Text Mining• Databases and Ontologies• Bioimage Informatics
    8. 8. (Borrowed from “An Introduction to Bioinformatics Algorithms” by Neil C.Jones and Pavel A. Pevzner and further modified by Prof. NatalioKrasnogor)BASIC MOLECULAR BIOLOGY
    9. 9. Life begins with Cell• A cell is the smallest structural unit of an organism that is capable of sustained independent functioning• All cells have some common features• What is Life? Can we create it in the lab? Read:The imitation game—a computational chemical approach to recognizing life. Nature Biotechnology, 24:1203-1206, 2006
    10. 10. 2 types of cells:Prokaryotes & Eukaryotes
    11. 11. Example of cell signaling
    12. 12. Terminology• The genome is an organism’s complete set of DNA. – a bacteria contains about 600,000 DNA base pairs – human and mouse genomes have some 3 billion.• human genome has 23 distinct chromosomes. – Each chromosome contains many genes.• Gene – basic physical and functional units of heredity. – specific sequences of DNA bases that encode instructions on how and when to make proteins.• Proteins – Make up the cellular structure – large, complex molecules made up of smaller subunits called amino acids.
    13. 13. All Life depends on 3 critical molecules• DNAs – Hold information on how cell works• RNAs – Act to transfer short pieces of information to different parts of cell – Provide templates to synthesize into protein• Proteins – Form enzymes that send signals to other cells and regulate gene activity – Form body’s major components (e.g. hair, skin, etc.) – Are life’s laborers!• Computationally, all three can be represented as sequences of a certain 4-letter (DNA/RNA) or 20-letter (Proteins) alphabet
    14. 14. DNA, RNA, and the Flow of Information Replication Transcription Translation Weismann Barrier / Central Dogma of Molecular Biology
    15. 15. Overview of DNA to RNA to Protein• A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis
    16. 16. DNA: The Basis of Life• Deoxyribonucleic Acid (DNA) – Double stranded with complementary strands A-T, C-G• DNA is a polymer – Sugar-Phosphate-Base – Bases held together by H bonding to the opposite strand
    17. 17. RNA • RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) • Some forms of RNA can form secondary structures by―pairing up‖ with itself. This can have impact on its properties dramatically. DNA and RNA can pair with each other.tRNA linear and 3D view:
    18. 18. RNA, continuedSeveral types exist, classified by function:• hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary transcipts with introns that have not yet been excised (pre-mRNA).• mRNA: this is what is usually being referred to when a Bioinformatician says ―RNA‖. This is used to carry a gene’s message out of the nucleus.• tRNA: transfers genetic information from mRNA to an amino acid sequence as to build a protein• rRNA: ribosomal RNA. Part of the ribosome which is involved in translation.
    19. 19. Transcription• Transcription is highly regulated. Most DNA is in a dense form where it cannot be transcribed.• To start, transcription requires a promoter, a small specific sequence of DNA to which polymerase can bind (~40 base pairs ―upstream‖ of gene)• Finding these promoter regions is only a partially solved problem that is related to motif finding.• There can also be repressors and inhibitors acting in various ways to stop transcription. This makes regulation of gene transcription complex to understand.
    20. 20. Definition of a Gene• Regulatory regions: up to 50 kb upstream of +1 site• Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp)• Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron• Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
    21. 21. Splicing
    22. 22. Splicing and other RNA processing• In Eukaryotic cells, RNA is processed between transcription and translation.• This complicates the relationship between a DNA gene and the protein it codes for.• Sometimes alternate RNA processing can lead to an alternate protein (splice variants) as a result. This is true in the immune system.
    23. 23. Proteins: Crucial molecules for the functioning of life• Structural Proteins: the organisms basic building blocks, eg.collagen, nails, hair, etc.• Enzymes: biological engines which mediate multitude of biochemicalreactions. Usually enzymes are very specific and catalyze only a single typeof reaction, but they can play a role in more than one pathway.• Transmembrane proteins: they are the cell’s housekeepers, eg. Byregulating cell volume, extraction and concentration of small molecules fromthe extracellular environment and generation of ionic gradients essential formuscle and nerve cell function (sodium/potasium pump is an example)• Proteins are polypeptide chains, constructed by joining a certain kind ofpeptides, amino acids, in a linear way• The chain of amino acids, however folds to create very complex 3Dstructures
    24. 24. Translation• The process of going from RNA to polypeptide.• Three base pairs of RNA (called a codon) correspond to one amino acid based on a fixed table.• Always starts with Methionine and ends with a stop codon
    25. 25. Amino Acids
    26. 26. Protein Structure: Introduction• Different amino acids have different properties• These properties will affect the protein structure and function• Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process
    28. 28. Protein Structure: Why is structure important? The function of a protein depends greatly on its structure The structure that a protein adopts is vital to it’s chemistry Its structure determines which of its amino acids are exposed to carry out the protein’s function Its structure also determines what substrates it can react with
    29. 29. Protein Structure: Mostly lacking information• Therefore, it is clear that knowing the structure of a protein is crucial for many tasks• However, we only know the structure for a very small fraction of all the proteins that we are aware of – The UniProtKB/TrEMBL archive contains 23165610 (16886838) sequences – The PDB archive of protein structure contains only 84223(76669) structures• In the native state, proteins fold on its own as soon as they are generated, amino-acid by amino-acid (with few exceptions e.g. chaperones)  can we predict this process as to close the gap between protein sequences and their 3D structures?
    30. 30. Central Dogma of Biology: A Bioinformatics Perspective The information for making proteins is stored in DNA. There is a process (transcription and translation) by which DNA is converted to protein. By understanding this process and how it is regulated we can make predictions and models of cells. Assembly Protein Sequence/Stru Sequence analysis cture Analysis Gene FindingComputational Problems
    32. 32. Information flow in bioinformatics• Data enters the “bioinformatics scope” when a scientist deposits an experimental result in an appropriate archive• The archive curates and annotates the data• The data is released to the public• Afterwards, the data may be retrieved/analysed: – Integrating the new entry into a search engine – Extracting useful subsets of the data – Deriving new types of information from the data – Aggregating the data, by homology, function, structure – Reannotating the data with new discovered/inferred info.• Quality of data depends on many factors, the techniques used to experimentally create the data, degree of inference and prediction involved in the annotation process, etc.• Many publicly available databases:
    33. 33. NCBI’s Entrez system is a search and retrieval system that integratesinformation from databases at NCBI (National Center forBiotechnology Information).
    34. 34. Uniprot• The Universal Protein Resource (UniProt) is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)
    35. 35. KEGG - • Not just about genes/proteins but also pathways, that is, their interactions
    36. 36. DAVID -
    38. 38. Sequences• Be it DNA, RNA or proteins we have many data that can be represented as sequences of a certain alphabet• Many generic algorithms to deal with biological sequences exist• Sequence alignment• Motif representation
    40. 40. Motivation• Similarity is expected among biomolecules that are descended from a common ancestor. – Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function – Important catalytic, regulatory or structural regions remain similar• An alignment between two or more genetic or proteomic sequences represents an explicit hypothesis vis a vis their evolutionary histories.• Thus comparison of related gene/protein sequences have been instrumental in shedding light into the information content of these sequences and their biological functions.
    41. 41. Definition and aims• Why align sequences? 1. Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. 2. Start with a small set of sequences and identify similarities and differences among them. 3. In many sequences or very long sequences, detect commonly occurring patterns
    42. 42. Similarity vs. Homology• Similarity is the observation or measurement of resemblance and difference, independent of the source of resemblance.• There are many examples of different organisms with functionally similar organs that came from distinc evolutionary origins• When similarity is due to a common ancestry, we call it homology.• Sequence alignment helps inferring homology hypothesis: – If two sequences are very similar, it is probable that there is a common origin – Therefore, if we know some information (structure, function) from sequence X, and sequence X is similar to sequence Y, it is probable that the same information applies to Y
    43. 43. Metrics of similarity: Definitions• Gap: a break in the alignment, in either one of the sequences. – For nucleotides, a consequence of an insertion or deletion mutation. – For proteins, it’s more difficult to say.• Regions of matching residues. – Indicate parts of a sequence that are well conserved• Mismatched residues. – For nucleotides, a consequence of a substitution mutation – Less conserved regions
    44. 44. Metrics of similarity: Distance scoring• Distance scoring – Given an alignment with matches, mismatches and gaps, we compute a score following: • For each mismatch, score is increased by 2 • For each gap, score is increased by 4 • For each match, no increase in score – Higher score, less similarity A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 = 18• Equivalent metrics exist for similarity (not distance) where higher score means good similarity
    45. 45. Metrics of similarity: Mismatches and gaps• Are all mismatches equally bad? – For protein sequences, there are several subgroups of amino acids with similar properties. Mismatches within a group have less impact – For nucleotide sequences, transition mutations (a↔g and t↔c) are more common than transversions (a or g ↔ t or c) mutations – Distance scoring of mismatches could be smarter  substitution matrices • Using statisical analysis on large corpus of real sequences to generate better scores• How to penalize gaps – Each gap slot gets equal distance score – One score to open a gap, another (smaller) score to extend the same gap
    46. 46. Global vs Local alignment• We know how to score good or bad alignments – How to find the optimal one?• Two classes of alignment methods – Global alignment • Finds the best alignment of one entire sequence with another entire sequence – Local alignment • Find the best alignment of one segment of a sequence against another segment of another sequence
    47. 47. Exact vs. Approximate methods• Exact methods for both global and local alignment exist, based on dynamic programming, but are slow – Good enough when there are few sequences – Not so good when comparing a target sequence to a database of millions of known sequences• Approximate methods have been used for many years for large- scale alignment tasks – They use some kind of heuristic to speed up the alignment process – BLAST (Basic Local Alignment Search Tool) is the most famous approximate method • It identifies potential hits by looking for perfect matches of very small sub-sequences (seeds) • It only tries to create a full alignment for sequences where several seeds are identified • PSI-BLAST: version that takes into account that multiple hits are identified. It constructs a tailored substitution matrix based on hits and then refines the alignment
    48. 48. Multiple Sequence Alignment• When we have to align more than two sequences• Progressive methods (e.g. ClustalW) – Start with seed alignment – Iteratively incorporate other alignments to seed, without modifying what is aligned so far – ClustalW uses phylogenetic trees (representations of the evolutionary relationship between sequences) to progressively construct MSA• Iterative methods (e.g. MUSCLE) – Can re-edit the partial MSA based on the newly incorporated alignments
    49. 49. ClustalWInterface inUniprot
    50. 50. Motifs• When visualising a MSA we can see regions of high agreement and regions of low agreement.• The high agreement regions define that a certain protein belongs to a family• What if we concentrate on modelling and identifying these regions instead of the whole sequences  Motif finding
    51. 51. Modelling motifs• Patterns – Model the subsequence as a regular expression • C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA] • Zing Finger motif • Can cope with moderate level of variability• Profiles – Specify the most likely values for each position in the motif  acts as a substituton matrix – Use sequence similarty metrics to compute a score of the motif for a given sequence 1 2 3 4 5 6 7 8 9 A: 0.5 0.25 0 1 0 0 0 0 0.5 T: 0 0.25 1 0 0 1 0 0.25 0 G: 0.5 0.25 0 0 1 0 0 0.75 0.25 C: 0 0.25 0 0 0 0 1 0 0.25• PROSITE implements both types of motifs
    52. 52. Modelling Motifs• Hidden Markov Models – Model the motif as a series of state transitions with probabilities associated to each input symbol and state – Easy to visualise – PFAM uses HMM motifs
    54. 54. DNA• Coding/non coding• SNPs• Copy number variation• Assembly• Methylation• Primer design
    55. 55. Coding/Non Coding• Identifying the regions from an organism’s genome that contain genes• Many different factors involved in this identification – Promoter identification – Long enough Open Reading Frames (ORF) – Splice variants – Introns/Exons (in Eucaryotes) – Statistical properties of gene-coding DNA• HMM are also used for gene finding
    56. 56. Single Nucleotide Polymorphisms (SNPs)• One base-pair variation in DNA• In most cases in non-coding regions of DNA, but not always• When frequent enough in a population they can be linked to specific traits, e.g. a disease• SNP microarrays can be used to probe hundreds of thousands of SNPs in parallel• In reality few SNPs act on their own – Genome-Wide Association Studies identify groups of SNPs linked to a certain condition
    57. 57. Copy Number Variation• In general two copies of each gene exist in a genome• It may be the cases than more/less than two copies exist of a certain gene for a specific sub-population• It has been suggested that certain CNV can be linked to specific diseases
    58. 58. Genome assembly• Sequencing technologies are able to read (sequence) a complete genome as a series of short overlapping fragments• How to assemble back all these fragments?• Greedy approach – Pair-wise alignments of all fragments – Merge fragments of largest overlap – Keep iterating until all segments are merged• Worked more or less well on old sequencing technologies, not so well on next-generation sequencing data, due to smaller fragment sizes and larger error rate
    59. 59. Genome mapping• Given a large set of short fragments, as a result of next-generation sequencing, map them to a reference genome• Different from previous one. We do not want to reconstitute a complete genome, just identify to which genes each fragment belongs (among other applications).• Speed is an issue• Modern methods (e.g. SOAP2) compress the genome and are able to align the fragments in the compressed space
    60. 60. Methylation• It is a chemical reaction that can block a certain region of a chromosome, preventing its transcription• The process can be reverted, so essentially it is an on/off switch of the affected gene• Specialised microarrays exist for the high- throughput detection of methylated genes• Afterwards, data analysis can take place
    61. 61. DNA library specification• A DNA library is a combinatorial set of DNA sequences suited to manufacture via DNA reuse• The first stage towards the creation of a DNA library is the formal specification of the target DNA molecules that comprise it• A set of sequences does not convey the intention behind the library Key challenge is to enable precise editing of DNA sequences in anextensible and reproducible mannerwhilst avoiding manual handling of these unwieldy objects
    62. 62. DNALD library format• A DNALD library consists of three sets of definitions: inputs, intermediates and outputs, with different semantics – Inputs: existing DNA sequences to be provided with design – Intermediates: conceptual means of factoring commons seqs – Outputs: to be produced through DNA reuse
    63. 63. DNALD expressions• A DNALD expression is a combination of explicit sequences, definition names, operators and functions that are interpreted according to rules of precedence and association ("evaluated") to produce a set of DNA sequences.• Definitions bind names to the results of expressions.
    64. 64. Workbench interfacemanageprojects text editor with: • syntax highlighting • auto-completion • code folding • etc. viewed from different perspectives
    66. 66. RNA• Expression• Structure prediction
    67. 67. RNA expression• Not all genes are transcribed/translated into proteins all the time• The expression of genes is highly sophisticated and depends on many factors• Identifying the genes being expressed in a given point of time in a specific tissue provides crucial information about the roles and interactions of such genes – Compare the genes expressed between different groups of samples to identify those that are differentially expressed – Identify co-expressed genes, that present patterns of correlation
    68. 68. Measuring RNA expression• RT-PCR (Real-time reverse polimerase chain reaction) – Measures accurately the expression of a pre- determined gene• RNA Microarrays – Measures, in parallel, the expression of tens of thousands of genes, but with considerable level of noise• RNA-Seq – The next-generation sequencing variant for measuring gene expresison
    69. 69. RNA Structure prediction• A RNA sequence can bind with itself to create complex shapes with a certain pattern of loops• Can we predict, from a given sequence, the structural shape of the RNA?
    70. 70. Proteins• Protein classification• Structure prediction• Structure comparison• Function and interaction
    71. 71. Protein classification• Proteins can be annotated in many different ways – Function • DNA-binding? Enzyme? – Tissue/Cellular/Sub-cellular localisation – Interacting with other proteins?• Can we predict this annotation using ML?• We need to transform the protein sequence into a uniform representation of equal size for all proteins• Many different representations exist• Several of these problems can be modelled as a hierarchical classification problem
    72. 72. Protein Structure Prediction• PSP aims to predict the 3D structure of a protein based on its primary sequence
    73. 73. Protein Structure Prediction PSP is an open problem. The 3D structure depends on many variables It has been one of the main holy grails of computational biology for many decades• Impact of having better protein structure models are countless – Genetic therapy – Synthesis of drugs for incurable diseases – Improved crops – Environmental remediation
    74. 74. Prediction types of PSP• There are several kinds of prediction problems within the scope of PSP – The main one, of course, is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence – There are many structural properties of individual residues within a protein that can be predicted, for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface – Accurate predictions of these sub-problems can simplify the general 3D PSP problem
    75. 75. 3D Protein Structure Prediction• Some PSP methods try to find similar proteins and then adapt the structure of the homolog (template) to the target protein  Homology Modeling• Other methods try to find the structure of the protein from scratch (Ab Initio Modelling), optimizing some energy function that models the stability of the protein, in case no homolog can be identified• In between there are other kind of methods, for varying degrees of good homology of our target, for instance, Fold Recognition or Threading • These methods identify a target based on more than homology (i.e. sequence alignment).
    76. 76. Coordination Number Prediction  Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) Native StatePrimary ContactSequence  CN of a residue : count of contacts that a certain residue has  CN gives us a simplified profile of the density of packing of the protein
    77. 77. Contact Map prediction• Prediction, given two residues from a chain, whether these two residues are in contact or not• This problem can be represented by a binary matrix. 1= contact, 0 = non contact• Plotting this matrix reveals many characteristics from the protein structure• Very sparse characteristic: Less than 2% of contacts in native structures helices sheets
    78. 78. Other predictions• Other kinds of residue structural aspects that can be predicted – Solvent accessibility: Amount of surface of each residue that is exposed to solvent – Recursive Convex Hull: A metric that models a protein as an onion, and assigns each residue to a layer. Formally, each layer is a convex hull of points• These features (and others) are predicted in a similar was as done for SS
    79. 79. Protein Structure Comparison
    80. 80. Protein Structure Comparison• Protein Structure Comparison (PSC) aims at – Assess the degree of similarity between protein structures – Given a query structure, identify other proteins with similar structure• Why? – Group proteins by structural similarities – Determine the impact of individual residues on the protein structure – Identify distant homologues of protein families – Predict function of proteins with low degree of primary structure (i.e.. sequence) similarity with other proteins – Engineer new proteins for specific functions – Assess ab-initio predictions
    81. 81. Protein Structure Comparison• Sequence-Structure-Function relationships 1) Conserved 1º sequences similar structures 2) Similar structures ? conserved 1º sequences 3) Similar structures conserved function• PSC shares many similarities with sequence alignment. Our aim is to infer new knowledge from the comparison process
    82. 82. Protein Structure Comparison• Existing Approaches – SSAP (Orengo & Taylor, 96) – ProSup (Feng & Sippl, 96) – DALI (Holm & Sander, 93) – CE (Shindyalov & Bourne, 98) – LGA (Zemla, 2003) – SCOP (Murzin, Brenner, Hubbard & Chothia, 95) – CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97) – ProCKSI – Consensus of multiple PSC methods
    83. 83. Prediction of Protein Function• In an ideal world, the cascade of inference should flow from sequence  structure  function• That is, if we can identify similar sequences of structures to our query target we can (at varying degrees of certainty) infer that they have similar function
    84. 84. Prediction of Protein Function• As proteins evolve, they may – Retain function and specificity – Retain function but alter specificity – Change to a related function, or a similar function in a different metabolic contxt – Change to a completely unrelated function• How much must a protein change before the function changes? – Sometimes, not at all. There are many cases of proteins with different functions in different environments
    85. 85. Prediction of Protein Function• Thus, sequence or structure similarity is not always reliable to assign function• Other ways of determining protein function – By identifying patterns of co-regulated genes • Using data from Microarray experiments – By identifying protein-protein interactions
    86. 86. Prediction of Protein Function• A related question is: where is the function of a protein taking place?  active site• Several methods exist to predict active/binding sites of proteins from local patterns of sequence or structure• A raw way of doing this prediction is to take a look at the conserved residues of a sequence  they may be related to either the core of the protein (structural stability) or the function of a protein (a change of function is a risk for survival)• More sophisticated methods exists to learn how to predict active sites. They use ML, in a similar way used to predict residue structural features in PSP• Still, it is a very tough problem, and ML methods are not much better than blast-based methods
    88. 88. Three case studies• Mining –omics data• Predicting structural aspects of protein residues• Automated alphabet reduction for protein datasets• In all these three case studies we use the same evolutionary learning system: BioHEL [Bacardit et al., 09]
    89. 89. BioHEL• BioHEL [Bacardit et al., 09] is an evolutionary learning system that applies the Iterative Rule Learning (IRL) approach• Designed explicitly to deal with noisy large-scale datasets• IRL was first used in EC by the SIA system [Venturini, 93]
    90. 90. BioHEL’s learning paradigm– IRL has been used for many years in the ML community, with the name of separate-and-conquer
    91. 91. BioHEL’s objective function• An objective function based on the Minimum- Description-Length (MDL) (Rissanen,1978) principle that tries to promote rules with – High accuracy: not making mistakes – High coverage: covering as much examples as possible without sacrificing accuracy. Recall (TP/(TP+FN)) will be used to define coverage – Low complexity: rules as simple and general as possible – The objective function is a linear combination of the three objectives above
    92. 92. BioHEL’s objective function• Intuitively, we would like to have accurate rules covering as much examples as possible.• However, in complex and inconsistent domains it is rare to obtain such rules• In these cases, easier path for evolutionary search is to maximize accuracy at the expense of coverage• Therefore, we need to enforce that the evolved rules cover enough examples
    93. 93. BioHEL’s objective function• Three parameters define the shape of the function• The choice of the coverage break is crucial for the proper performance of the system• Also, coverage term penalizes rules that do not cover a minimum percentage of examples or that cover too many
    94. 94. BioHEL’s characteristics• Attribute list rule representation – Automatically identifying the relevant attributes for a given rule and discarding all the other ones• The ILAS windowing scheme – Efficiency enhancement method, not all training points are used for each fitness computation• An explicit default rule mechanism – Generating more compact rule sets – Iterative process terminates when it is impossible to evolve a rule where the associated class is the majority class among the matched examples – At this point, all remaining training instances are assigned to the default class
    95. 95. MINING –OMICS DATA
    96. 96. Mining –omics data• Biological data can be generated at many different levels – Genomics (DNA) – Transcriptomics (RNA) – Proteomics (proteins) – Metabolomics (small compounds) – Lipidomics (lipids)• Hundreds of –omics have been catalogued
    97. 97. How an –omics dataset looks like?• In most cases datasets present a similar structure• Each sample is characteristed by a large number of variables (RNA, Proteins, lipids, etc.)• Each variable indicates (usually quantitatively) the presence of that element in the sample• Due to the high cost of most –omics technologies, variables >> samples – Problems of over-fitting
    98. 98. What can we do with the dataset?• In most cases, samples are annotated with a qualitative label – Cancer/Non-cancer patients – Samples of seed tissue for which it is known if the seed germinated or not – Age of the sample• Therefore, we can treat these datasets as classification problems, and generate prediction models from the data• Not just as classification problems – Clustering/Biclustering – Association Rule Mining – Regression
    99. 99. But in most cases, domain experts are not (only) interested in predictions• Biomarker identification – Identify the key variables • Most strongly associated to each outcome – Using e.g. t-tests to identify those • Presenting higher prediction capacity – As identified by ML methods – Identify interactions between variables • By presenting very high (anti)correlation between them • By acting together to generate predictions
    100. 100. Functional Network Reconstruction for seed germination Microarray data obtained from seed tissue of Arabidopsis Thaliana 122 samples represented by the expression level of almost 14000 genes It had been experimentally determined whether each of the seeds had germinated or not Can we learn to predict germination/dormancy from the microarray data? [Bassel et al., 2011]
    101. 101. Generating rule sets BioHEL was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross- validation Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracyIf At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  PredictgerminationIf At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  PredictgerminationIf At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germinationIf At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80 Predict germinationEverything else  Predict dormancy
    102. 102. Identifying regulators Rule building process is stochastic  Generates different rule sets each time the system is run But if we run the system many times, we can see some patterns in the rule sets  Genes appearing quite more frequent than the rest  Some associated to dormancy  Some associated to germination
    103. 103. Known regulators appear with high frequency in the rules
    104. 104. Generating co-prediction networks of interactions• For each of the rules shown before to be true, all of the conditions in it need to be true at the same time – Each rule is expressing an interaction between certain gens• From a high number of rule sets we can identify pairs of genes that co-occur with high frequency and generate functional networks• The network shows different topology when compared to other type of network construction methods (e.g. by gene co- expression)• Different regions in the network contain the germination and dormancy genes
    105. 105. Experimental validation We have experimentally verified this analysis  By ordering and planting knockouts for the highly ranked genes  We have been able to identify four new regulators of germination, with different phenotype from the wild type
    107. 107. Prediction of structural aspects of protein residues• Many of these features are due to local interactions of an amino acid and its immediate neighbours – Can it be predicted using information from the closest neighbours in the chain? Ri-5 Ri-4 Ri-3 Ri-2 Ri-1 Ri Ri+1 Ri+2 Ri+3 Ri+4 Ri+5 SSi-5 SSi-4 SSi-3 SSi-2 SSi-1 SSi SSi+1 SSi+2 SSi+3 SSi+4 SSi+5 Ri-1 Ri Ri+1  SSi Ri Ri+1 Ri+2  SSi+1 Ri+1 Ri+2 Ri+3  SSi+2 – In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target
    108. 108. ARFF file for a simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1
    109. 109. What information do we include for each residue? – Early prediction methods used just the primary sequence  the AA types of the residues in the window – However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not – Where can we obtain this information? • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment
    110. 110. Position-Specific Scoring Matrices (PSSM)– For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query)– This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence– In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning– A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions
    111. 111. PSSM for the 10 first residues of 1n7lA A R N D C Q E G H I L K M F P S T W Y VA: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
    112. 112. Secondary Structure Prediction– The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state– Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP– Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM).– A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances
    113. 113. Secondary Structure PredictionR1 R2 R3 Rn-1 Rn PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn MSAPrimary sequence PSSM profile of sequence SSi? Prediction PSSMi-1 PSSMi PSSMi+1 Windows method generationPrediction Window of PSSM profiles
    114. 114. Other prediction problems• This same structure of prediction can be applied to most 1D structural aspects• However, many of these features are natively continuous measures (or integer)• To treat these problems as classification problems, we need to discretise the output• Unsupervised methods are applied – Uniform length and uniform frequency disc.UFUL
    115. 115. PSP datasets are good ML benchmarks• These problems can be modelled in may ways: – Regression or classification problems – Low/high number of classes – Balanced/unbalanced classes – Adjustable number of attributes• Ideal benchmarks !!• mark.html
    116. 116. Contact Map Prediction• We participated in the CASP9 competition• CASP = Critical Assessment of Techniques for Protein Structure Prediction. Biannual competition• Every day, for about three months, the organizers release some protein sequences for which nobody knows the structure (129 sequences were released in CASP9, in 2010)• Each prediction group is given three weeks to return their predictions• If the machinery is not well oiled, it is not feasible to participate !!• For CM, prediction groups have to return a list of predicted contacts (they are not interested in non-contacts) and, for each predicted pair of contacting residues, a confidence level
    117. 117. Contact Map prediction• Prediction given two residues from a chain whether these two residues are in contact or not• This problem can be represented by a binary matrix. 1= contact 0 = non contact• Plotting this matrix reveals many characteristics from the protein structure helices sheets
    118. 118. Steps for CM prediction (Nottingham method)1. Prediction of  Secondary structure (using PSIPRED)  Solvent Accessibility  Recursive Convex Hull Using BioHEL [Bacardit et al., 09]  Coordination Number2. Integration of all these predictions plus other sources of information3. Final CM prediction (using BioHEL)
    119. 119. Prediction of RCH, SA and CN We selected a set of 3262 protein chains from PDB-REPRDB with:  A resolution less than 2Å  Less than 30% sequence identify  Without chain breaks nor non-standard residues 90% of this set was used for training (~490000 residues) 10% for test
    120. 120. Prediction of RCH, SA and CN All three features were predicted based on a window of ±4 residues around the target  Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information  Each residue is characterised by a vector of 180 values The domain for all three features was partitioned into 5 states
    121. 121. Characterisation of the contact map problem  Three types of input information were used 1. Detailed information of three different windows of residues centered around  The two target residues (2x)  The middle point between them 2. Information about the connecting segment between the two target residues and 3. Global protein information. 13 2
    122. 122. Contact Map dataset From the original set of 3262 proteins we kept all that had <250 AA and a randomly selected 20% of larger proteins Still, the resulting training set contained 32 million pairs of AA and 631 attributes Less than 2% of those are actual contacts +60GB of disk space
    123. 123. Samples and ensembles Training set  50 samples of 660K examples are generated from the training set with a x50 ratio of 2:1 non-contacts/contactsSamples  BioHEL is run 25 times for each sample  Prediction is done by a consensus of x25 1250 rule setsRule sets  Confidence of prediction is computed based on the votes distribution in the ensemble.  Whole training process took about 25K Consensus CPU hours Predictions
    124. 124. Contact Map prediction in CASP Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10} From these L/x top ranked contacts two measures are computed  Accuracy: TP/(TP+FP)  Xd: difference between the distribution of predicted distance and a random distribution
    125. 125. CASP9 results These two groups derived contact predictions from 3D models
    126. 126. Understanding the rule sets Each rule set has in average 135 rules We have a total of 168470 rules Impossible to read all of them individually, but we can extract useful statistics For instance, how often was each attribute used in the rules? Full analysis
    127. 127. Distribution of frequency of use of attributes All 631 attributes are actually used (min frequency=429) However, some of them are used much more frequently than others
    128. 128. Top 10 attributes Attribute Frequency Count s PredSS_r1_1 1.48% 18141 PredCN_r1 1.66% 20336 propensity 1.74% 21288 PredSS_r2 1.75% 21350 PredSS_r1 1.82% 22205 PredRCH_r2 1.87% 22856 PredRCH_r1 2.04% 24961 PredSA_r2 2.12% 25891 PredSA_r1 2.39% 29246 separation 4.17% 50951The four kind of residue’s predictions are highly ranked
    130. 130. Motivation• PSP is a very costly process• As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure• One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins
    131. 131. Target for reduction: the primary sequence• The primary sequence of a protein is an usual target for such simplification – It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them – One example of reduction widely used in the community is the hydrophobic- polar (HP) alphabet, reducing these 20 symbols to just two – HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06]• Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?
    132. 132. Automated Alphabet Reduction [Bacardit et al., 09]• We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets• An optimization algorithm will cluster the AA alphabet into a predefined number of new letters• Fitness function of optimization is based on the Mutual Information (MI) metric. A metric that quantifies the interrelationship between two discrete variables – Aim is to find the reduced representation that maintains as much relevant information as possible for the feature being predicted• Afterwards we will feed the reduced dataset into a learning method to verify if the reduction was proper
    133. 133. Alphabet Reduction protocol Size = N Test setDataset ECGA Dataset BioHEL EnsembleCard=20 Card=N of rule sets Accuracy Mutual Information 133
    134. 134. Automated Alphabet Reduction Competent 5-letter alphabet (similar performance to the AA alphabet) Different alphabets for CN and SA domains Unexpected explanations: Alphabet reduction clustered AA types that experts did not expect
    135. 135. Automated Alphabet Reduction  Our method produces better reduced alphabets than other reduced alphabets from the literature and than other expert- designed ones Alphabet Letters CN acc. SA acc. Diff. Ref. AA 20 74.0±0.6 70.7±0.4 --- ---Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07] WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99] Alphabets SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00] from the MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00] literature MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06] HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07] Expert HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07] designed HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07] alphabets
    136. 136. Efficiency gains from the alphabet reduction• We have extrapolated the reduced alphabet to the much larger and richer Position-Specific Scoring Matrices (PSSM) representation• Accuracy difference is still less than 1%• Obtained rule sets are simpler and training process is much faster• Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07]• Won the bronze medal of the 2007 Humies awards
    137. 137. Conclusions• Bioinformatics contain many challenges that computer science can tackle – Optimisation – Machine learning – Software engineering• Evolutionary computation has shown to be very competitive across a large range of bioinformatics problems• Facing these challenges for EC has led to the development of many new methods
    138. 138. References/Bibliography• Journals – The Bioinformatics Journal – BMC Bioinformatics – BMC Biodata Mining• Bioinformatics books – Introduction to Bioinformatics by Arthur Lesk, Oxford University Press. – Introduction to Bioinformatics. A. Tramontano, Chapman and Hall/CRC• Specialised topics – Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer (ed). Springer – Next-Generation Sequencing special issue of the Bioinformatics Journal; rationsequencing.html
    139. 139. References/Bibliography• J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number prediction using Learning Classifier Systems: Performance and interpretability. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp. 247-254, ACM Press, 2006• Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008• Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal, 13(3):245- 258, 2009• J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing journal 1(1):55-67, 2009• J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated Alphabet Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009• George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011• J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics first published online July 25, 2012 doi:10.1093/bioinformatics/bts472
    140. 140. References/Bibliography• Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies Bioinformatics (2010) 26(4): 445-455• Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and sequence based descriptors for protein classification, Journal of Theoretical Biology 266(1):1- 10, 2010• Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for protein function prediction, Memetic Computing 2(3):165-181, 2010• Daniel Barthel et al., Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007•• Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-602.• Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-gene associations from Quantitative Association Rules In: 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246• Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano, Yves Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of estimation of distribution algorithms in bioinformatics. BioData Mining 2008, 1:6 (11 September 2008)
    141. 141. Acknowledgements• Prof. Natalio Krasnogor• Prof. Michael Holdsworth• Prof. Jonathan Hirst• Dr. Michael Stout• Dr. George Bassel• Dr. Enrico Glaab• Dr. Pawel Widera• EPSRC GR/T07534/01 & EP/H016597/1• EU FP7 CADMAD project
    142. 142. Introduction to Bioinformatics Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham