Outline Workshops chronology on hands out Brief background information Applications & role Bioinformatics tools Practical classes Problem solving exercises What’s expected of you ? Questions/comments are welcome at all points
Aims To introduce the concepts and language of bioinformatics. To provide an understanding of how nucleic acid and protein sequence data is obtained and analysed. To develop skills in utilising online databases and interpreting data. To develop an understanding of how bioinformatics can be applied to solve specific problems in biomedical science. To develop transferable IT and communications skills.
In this workshop….. You will learn about how data is generated and analysed As well as what the generated data can tell us about the molecular biology of organisms And various practical applications of this knowledge
Why bioinformatics? Over the past decade massive amounts of sequence data have been generated This has more recently been joined by gene expression data obtained from microarrays and proteomic technologies This vast amount of data can only be analysed using various specialised computer algorithms
Main Topics (Review............) Genome organisation and analysis Functional genomics Advanced techniques in molecular biology Archives, information retrieval and alignments: Nucleic acid sequence databases; genome databases; protein sequence databases; database searching Dot plots (SIMILARITY MATRX) and sequence alignments (PSI BLAST); Genome expression: Microarray analysis, proteomics, eukaryotic genome expression
Five W that all biologists should know NCBI (The National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute) http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource) http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank) http://www.rcsb.org/PDB/
Remember while using web server-based tools You are using someone else’s computer You are (probably) getting a reduced set of options or capacity Servers are great for sporadic or proof- of-principle work, but for intensive work, the software should be obtained and run locally
Human Gene Index Database HGI is a database of expressed DNA sequences, mostly made of ESTs, which are a type of partial cDNA EST stands for Expressed Sequence Tag These short sequences were created using essentially the same method used to make cDNAs As such they represent the expressed part of a genome and are made from mRNA which is ultimately expressed from GENES
Similarity Searching There are a variety of computer programs that are used for making comparisons between DNA sequences. The most popular is known as BLAST (Basic Local Alignment Search Tool) BLAST is free at the NCBI website
BLAST is Complex Similarity searching relies on the concepts of alignment and distance between pairs of sequences. Distances can only be measured between aligned sequences (match vs. mismatch at each position). A similarity search is a process of testing the best alignment of a query sequence with every sequence in a database.
Workshop -1 (database search & inference of possible homology) Please refer to getting started with bioinformatics INTRO TO BLAST Basic Local Alignment Search Tool It is used to compare a query sequence with those contained in nucleotide databases by aligning the query sequence with previously characterised genes, therefore helping in identifying genes. The emphasis of this tool is to find regions of sequence similarity between two different genes. These sequence alignments can yield clues about the structure and function of a novel sequence, and about its evolutionary history and homology with other sequences in the database.
BLAST has AutomaticTranslation BLASTX makes automatic translation (in all 6 reading frames) of your DNA query sequence to compare with protein databanks TBLASTN makes automatic translation of an entire DNA database to compare with your protein query sequence Only make a DNA-DNA search if you are working with a sequence that does not code for protein.
A typical sequence ready for submission to BLAST>THC2465887GGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCCTATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAACTCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCAAGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCTTCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCCAGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGAAGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTGATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAAGCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAA
Understand theStatistics! BLAST produces an E-value for every match This is the same as the P value in a statistical test A match is generally considered significant if the E-value < 0.05 (smaller numbers are more significant) Very low E-values (e-100) are homologs or identical genes Moderate E-values are related genes Long regions of moderate similarity are more important than short regions of high identity.
BLAST is Approximate BLAST makes similarity searches very quickly because it takes shortcuts. looks for short, nearly identical “words” (11 bases) It also makes errors misses some important similarities makes many incorrect matches easily fooled by repeats or skewed composition
Bad GenomeAnnotation Gene finding is at best only 90% accurate. New sequences are automatically annotated with BLAST scores. Bad annotations propagate Its going to take us 10-20 years or more to sort this mess out!
Conclusions We have only touched small parts of the elephant Trial and error (intelligently) is often your best tool Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available