Upcoming SlideShare
×

# BITS: Basics of sequence analysis

5,247 views

Published on

Module 3 Sequence analysis.

Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

Published in: Education, Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
5,247
On SlideShare
0
From Embeds
0
Number of Embeds
311
Actions
Shares
0
86
0
Likes
1
Embeds 0
No embeds

No notes for slide

### BITS: Basics of sequence analysis

1. 1. Basic bioinformatics concepts, databases and tools Module 3 Sequence analysis Joachim Jacob http://www.bits.vib.beUpdated Feb 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod3-intro_H1_2012_SeqAn.pdf
2. 2. In this third module, we will discuss thepossible analyses of sequences  Module 1 Sequence databases and keyword searching  Module 2 Sequence similarity  Module 3 Sequence analysis: types, interpretation, results
3. 3. In this third module, we will discuss thepossible analyses of sequences
4. 4. Sequence analysis tries to readsequences to infer biological properties AGCTACTACGGACTACTAGCAGCTACCTCTCTG - is this coding sequence? - can this sequence bind a certain TF? - what is the melting temperature? - what is the GC content? - does it fold into a stable secondary structure? …
5. 5. Tools that can predict a biologicalfeature are trained with examples Automatic annotation vs. experimentally verified annotations - Training dataset of sequences (← exp. verified) - An algorithm defines parameters used for prediction - The algorithm determines/classifies whether the sequence(s) contains the feature (→ automatic annotation)
6. 6. The assumption to being able to readbiological function is the central paradigm DNA → protein sequence → structure → activity (binding, enzymatic activity, regulatory,...)  So the premise to do analysis: biological function can be read from the (DNA) sequence.  Predictions always serve as a basis for further experiments.
7. 7. Analysis can be as simple as measuringproperties or predicting features  Protein − Metrics (e.g. how many alanines in my seq) − Modifications and other predictions − Domains and motifs  DNA/RNA − Metrics (e.g. how many GC) − Predicting  Gene prediction  Promotor  Structure
8. 8. Simple protein sequence analysis  One might be interested in:  pI (isoelectric point) prediction  Composition metrics  Hydrophobicity calculation  Reverse translation (protein → dna)  Occurrence of simple patterns (e.g. does KDL occurs and how many times)  ... http://en.wikipedia.org/wiki/Hydrophobicity_scaleshttp://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-cha
9. 9. Protein sequence analysis tools aregathered on Expasy http://www.expasy.org/tools (SIB)  Others:  http://www.ebi.ac.uk/Tools/protein.html  http://bioweb.pasteur.fr/protein/intro-en.html  SMS2
10. 10. Never trust a tools output blindly  Interpreting depends on the kind of output When a prediction result is obtained, the question arises Is it true? (in biological sense)  Programs giving a binary  Programs giving score/P- result: 1 or 0, a hit or a miss. value result: the chance that the result is not real → the Approach: You should lower, the better comparing different prediction programs for Approach: asses the p-value higher confidence. E.g. ScanProsite for a motif E.g. SignalP for signal peptide prediction.
11. 11. The basis for the prediction of featuresis nearly always a sequence alignment Based on experimentally verified sequence annotations, a multiple sequence alignment is constructed Different methods exist to capture the information gained from this multiple sequence alignment
12. 12. Alignment reveals similar residues which can indicate identical structure Same structure, hence most likely same functionMost protein pairs with more than 25-30 out of 100 identical residues were Chances are thatfound to be structurally similar. the structure is notAlso proteins with <10% identity can the samehave similar structure.http://peds.oxfordjournals.org/content/12/2/85.long
13. 13. The structure of a protein sequencedetermines his biological functionNumber of Primary = AA chain Reportedstructures Feb 2012: ~ 535 000 in Swissprot Secondary = structural entities (helix, beta-strands, beta-sheets, loops) Tertiary = 3D Nov 2011: ~ 80 000 in PDB Quaternary = interactions http://en.wikipedia.org/wiki/Protein_structure
14. 14. Degree of similarity with other sequences varies over the length More conservedHomologousHistone H1protein sequences
15. 15. Protein sequences can consist ofstructurally different parts Domain part of the tertiary structure of a protein that can exist, function and evolve independently of the rest, linked to a certain biological function Motif part (not necessarily contiguous) of the primary structure of a protein that corresponds to the signature of a biological function. Can be associated with a domain. Feature part of the sequence for which some annotation has been added. Some features correspond to domain or motif assignments.
16. 16. Based on motifs and domains, proteinsare assigned to families Nearly synonymous with gene family Evolutionary related proteins Significant structural similarity of domains is reflected in sequence similarity, and is due to a common ancestral sequence part, resulting in domain families.
17. 17. Domains and motifs are represented bysimple and complex methods domain Gapped alignmentMotif/domain in silico can be represented by 1. Regular expression / pattern 2. Frequency matrix / profile 3. Machine learning techniques : Hidden Markov Model http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
18. 18. Regular expressions / patterns are the simplest way to represent motifs A representation of all residues with equal probability. 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA [AKT] [AKLT] P [AK] [APT] [ADEKT-] AKPKAK TKPKPA AKPKT- AKPAAK ? Does this sequence match: AKPKTE KLPKAD V V V V V V AKPKAAConsensus: AKPKAA ? And this sequence: KKPETE V V V X V V ? And what about this one: TLPATEFor every position the most V V V V V VFrequently occurring residue
19. 19. Frequency matrices or profiles include the chance of observing the residues For every position of a motif, a list of all amino acids is made with their frequency. Position-specific weight/scoring matrix or profile. More sensitive way. Profile 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA A 0.625 0 0 1/8 6/8 3/8 AKPKAK D 0 0 0 0 0 1/8 TKPKPA E 0 0 0 0 0 1/8 AKPKT- K 0.25 6/8 0 7/8 0 2/8 AKPAAK L 0 1/8 0 0 0 0 KLPKAD P 0 0 1 0 1/8 0 AKPKAA T 1/8 1/8 0 0 1/8 0 Consensus: AKPKA- - 0 0 0 0 0 1/8 Sum 1 1 1 1 1 1 ? Query: AKPKTE ? Query: KKPETE ? Query: TLPATEExample: http://expasy.org/prosite/PS51092 http://prosite.expasy.org/prosuser.html#meth2
20. 20. How good a sequence matches a profile is reported with a score PSWM: scores 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA A 2.377 -2.358 -2.358 0.257 2.631 1.676 AKPKAK D -2.358 -2.358 -2.358 -2.358 -2.358 0.257 TKPKPA E -2.358 -2.358 -2.358 -2.358 -2.358 0.257 AKPKT- K 1.134 2.631 -2.358 2.847 -2.358 1.134 AKPAAK L -2.358 0.257 -2.358 -2.358 -2.358 -2.358 P -2.358 -2.358 0.257 -2.358 0.257 -2.358 KLPKAD T 0.257 0.257 -2.358 -2.358 0.257 -2.358 AKPKAAConsensus: AKPKA- ? Query: AKPKTE Score = 11.4 ? Query: KKPETE Score = 5.0 ? Query: TLPATE Score = 4.3 http://prosite.expasy.org/prosuser.html#meth2
21. 21. A hidden Markov Model takes also intoaccount the gaps in an alignment The schematic representation of a HMM http://www.myoops.org/twocw/mit/NR/rdonlyres/Electrical-Engineering-
22. 22. Building a HMM from a multiplesequence alignment
23. 23. Use HMMER to very sensitively searchprotein database with a HMM You can search with a profile in a sequence database
24. 24. Some profile adjustments to the BLAST protocol exist for particular purposes PSI-BLAST to identify distantly related proteins PSI-BLAST (position specific iterated) After a search result, a profile is made of the similar sequences, and this is used again to search a database PHI-BLAST protein with matching of a pattern PHI-BLAST (pattern hit initiated): you provide a pattern, which all BLAST results should satisfy. CSI-BLAST is more sensitive than PSI-BLAST in identifying distantly related proteinsPSI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=bPHI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=bCSI BLAST http://toolkit.tuebingen.mpg.de/cs_blast
25. 25. Many databases exist that keep patterns,profiles or models related to function  Motif / domain databases (see NCBI bookshelf for good overview)  http://www.ebi.ac.uk/interpro/ - integrated db  http://expasy.org/prosite/ (motifs)  PFAM – hidden markov profiles (domains)  CDD (Conserved domains database) (NCBI - integrated)  Prodom (domain) (automatic extraction)  SMART (domain)  PRINTS (motif) sets of local alignments without gaps, used as frequency matrices, made by searching manually made "seed alignments" against UniProt sequences
26. 26. Prosite is a database gatheringpatterns from sequence alignments  ScanProsite tool : search the prosite database for a pattern ( present or not ) Example : [DE](2)-H-S-{P}-x(2)-P-x(2,4)-C> You can retrieve sequences which correspond to a pattern, you made up yourself, observed in an alignment or an known one. The syntax is specific, but not difficult: see link below! http://prosite.expasy.org/scanprosite/scanprosite-doc.html#pattern_syntax
27. 27. Interpro classifies the protein data intofamilies based on the domain and motifs Interpro takes all existing motif and domains databases as input (signatures), and aligns them to create protein domain families. This reduces redundancy. Each domain is than given an identifier IPRxxxxxxx. Uneven size of motifs and families between families are handled by relations : parent - child and contains - found in Families,... Regions, domains, ... http://www.ebi.ac.uk/interpro/user_manual.html#type
28. 28. Interpro summarizes domains and motifsfrom a dozen of domain databases http://www.ebi.ac.uk/interpro/databases.html ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/README.html#2
29. 29. InterPro entries are grouped in types  Family Entries span complete sequence  Domain Biologically functional units  Repeat  Region  Conserved site  Active site  Binding site  PTM site
30. 30. InterPro entries are grouped in types
31. 31. You can search your sequence forknown domains on InterProScan Interproscan http://www.ebi.ac.uk/Tools/pfa/iprscan/
32. 32. A sequence logo provides a visualsummary of a motif Creating a sequence logo Create a nicely looking logo of a motif sequence: size of letters indicated frequency.  Weblogo - a basic web application to create colorful logos  IceLogo - a powerful web application to create customized logos
33. 33. A sequence logo provides a visual summary of a motif iceLogo 123456 ATPKAE KKPKAA AKPKAK TKPKPA AKPKT- AKPAAK KLPKAD AKPKAA Consensus: AKPKA-http://www.bits.vib.be/wiki/index.php/Exercises_on_multiple_sequence_alignment#Sequence_logo
34. 34. There is always a chance that a predictionof a feature by a tool is falseNumber Number of ofmatches matches True negatives True negatives True positives True positives Score Score Threshold Threshold False negatives False positives Ideal situation Reality of the databases
35. 35. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted True FalseSequence contains feature positive Negatives “Type I error” TRUTH False TrueSequence does NOT contain feature positive negative “Type II error”
36. 36. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted SensitivitySequence contains feature True positives/(TP + FN) TRUTH False TrueSequence does NOT contain feature positive negative
37. 37. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predictedSequence contains feature TRUTH SelectivitySequence does NOT contain feature or Specificity TN/(FP + TN)
38. 38. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predictedSequence contains feature error rate* TRUTH FP+FN/totalSequence does NOT contain feature * misclassification rate
39. 39. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predictedSequence contains feature Accuracy TRUTH TP+TN/totalSequence does NOT contain feature
40. 40. Protein sequences can be searched for potential modifications  http://www.expasy.org/tools/  e.g. modification (phosphorylation, acetylation,...) To deal with the confidence in the results, try different tools, and make a graph (venn diagram) to compare the results  E.g. predict secreted proteins by signalP and RPSP, combine results in Venn − http://bioinformatics.psb.ugent.be/webtools/Venn/ − http://www.cmbi.ru.nl/cdd/biovenn/Overview SignalPeptide prediction tools: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788353/
41. 41. Protein sequences can be searched forsecondary structural elements Based on know structures, machine learning models of secondary structure elements are made and can be searched for. See http://bioinf.cs.ucl.ac.uk/psipred/
42. 42. In case of multiple analyses on multiplesequences, mark instead of filter Starting set of sequencesWorse BetterAnalysis filter 1 Analysis filter 1 Analysis filter 2 Analysis filter 3Analysis filter 2Analysis filter 3 After performing all analyses on all sequences, different filters on the results can be applied (e.g. secreted sequence, ! phosphorylated and containing a motif)
43. 43. NA sequences
44. 44. NA sequence analyses GC% http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::geecee Melting temperature For primer development, such as with Primer3 Structure Codon usage Codon usage table with cusp Codon adaptation index calculation with cai ... A lot of tools can be found at the Mobyle Portal:
45. 45. Profiles and models are being used tomodel biological function in NA seqs To detect Transcription factor binding sites TRANSFAC : commercial (BIOBASE, Wolfenbüttel, Germany), started as work of Edgard Wingender, contains eukaryotic binding sites as consensus sequences and as PSSMs. Also TRANSCompel with modules of binding sites. ooTFD : commercial (IFTI, Pittsburgh PA, USA), started as work of David Gosh, contains prokaryotic and eukaroytic binding sites as consensus sequences and as PSSMs. JASPAR : open access, only representative sets of higher eukaryote binding sites as PSSMs. Can be searched against sequence or sequence pair at Consite. OregAnno : open access, collection of individual eukaryotic binding sites with their localization in the genome PAZAR : collection of open access TF databanks
46. 46. Sequence logos can give an insight inthe important residues of binding sites DNA: an entry from JASPAR: tata box A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ] C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ] G [152 18 2 2 5 0 10 44 157 150 128 128 128 139 140 ] T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]
47. 47. The RNA world has the Vienna servers http://rna.tbi.univie.ac.at/ − secondary structure prediction of ribosomal sequences − siRNA design
48. 48. RNA families can be modeled byconserved bases and structure  RNA motifs (http://rfam.sanger.ac.uk/search) Rfam is a databank of RNA motifs and families. It is made at the Sanger Centre (Hinxton, UK), from a subset of EMBL (well-annotated standard sequences excluding synthetic sequences + the WGS) using the INFERNAL suite of Soan Eddy. It contains local alignments with gaps with included secondary structure annotation + CMs.
49. 49. Some interesting links Nucleic acid structure Unafold - Program accessible through webinterface After designing primers, you might want to check whether the primer product does (not) adapt a stable secondary structure. Some collections of links − Good overview at http://www.imb-jena.de/RNA.html − European Ribosomale RNA database (VIB PSB)
50. 50. Prediction of genes in genomes rely onthe integration of multiple signals Signals surrounding the gene (transcription factor binding sites, promoters, transcription terminators, splice sites, polyA sites, ribosome binding sites,...) → profile matching Differences in composition between coding and noncoding DNA (codon preference), the presence of an Open Reading Frame (ORF) → compositional analyses Similarity with known genes, aligning ESTs and (in translation) similarity with known proteins and the presence of protein motifs → similarity searches
51. 51. Prediction of genes in genomes rely onthe integration of multiple signals Signals Composition Similaritye.g. potential methylation sites (profiles) Alignment of ESTsGC
52. 52. Software for prediction genes EMBOSS − simple software under EMBOSS : syco (codon frequency), wobble (%GC 3rd base), tcode (Ficket statistic : correlation between bases at distance 3) Examples of software using HMM model of gene : Wise2 : using also similarity with known proteins http://www.ebi.ac.uk/Tools/Wise2 GENSCAN : commercial (Chris Burge, Stanford U.) but free for academics, has models for human/A. thaliana/maize, used at EBI and NCBI for genome annotation http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::genscan GeneMark : commercial (GeneProbe, Atlanta GA, USA) but free for academic users, developed by Mark Borodovsky, has models for many prokaryotic and eukaryotic organisms http://exon.gatech.eduTutorial on gene prediction http://www.embl.de/~seqanal/courses/spring00/GenePred.00.html
53. 53. Short addendum about downloading files FTP, e.g. ftp://ftp.ebi.ac.uk/pub/databases/interpro/ – file transfer protocol – Most browsers have integrated ftp client – Free, easy to download files, possibility to resume after fails HTTP, e.g. http://www.ncbi.nlm.nih.gov/entrez Standard protocol for internet traffic, Slowest method Aspera – for large datasets (>10GB) downloads In use in the short read archive (SRA) Fastest method available currently
54. 54. Conclusion  Prediction vs. experimental verified  Different algorithms need to be compared  Predictions need to be validated by independent method  Software <-> Databases  Questions? Get social! → www.seqanswers.com → http://biostar.stackexchange.com  Always only basis for further wet-lab research
55. 55. Summary In this third module, we will discuss the possible analyses of sequences Sequence analysis tries to read sequences to infer biological properties Tools that can predict a biological feature are trained with examples The assumption to being able to read biological function is the central paradigm Analysis can be as simple as measuring properties or predicting features Protein sequence analysis tools are gathered on Expasy Never trust a tools output blindly The basis for the prediction of features is nearly always a sequence alignment Alignment reveals similar residues which can indicate identical structure The structure of a protein sequence determines his biological function Degree of similarity with other sequences varies over the length Protein sequences can consist of structurally different parts Based on motifs and domains, proteins are assigned to families Domains and motifs are represented by simple and complex methods Regular expressions / patterns are the simplest way to represent motifs Frequency matrices or profiles include the chance of observing the residues How good a sequence matches a profile is reported with a score A hidden Markov Model takes also into account the gaps in an alignment Use HMMER to very sensitively search protein database with a HMM Some profile adjustments to the BLAST protocol exist for particular purposes Many databases exist that keep patterns, profiles or models related to function Prosite is a database gathering patterns from sequence alignments Interpro classifies the protein data into families based on the domain and motifs Interpro summarizes domains and motifs from a dozen of domain databases You can search your sequence for known domains on InterProScan A sequence logo can provide a visual summary of a motif Protein sequences can be searched for potential modifications Protein sequences can be searched for secondary structural elements In case of multiple analyses on multiple sequences, mark instead of filter Profiles and models are being used to model biological function in NA seqs Sequence logos can give an insight in the important residues of binding sites The RNA world has the Vienna servers RNA families can be modeled by conserved bases and structure Prediction of genes in genomes rely on the integration of multiple signals