Dr Justin Schonfeld - Bioinformatics Applications


Published on

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 94 fungi, 55 plant, 83 other
  • Does BOLD filter stop codons
  • Dr Justin Schonfeld - Bioinformatics Applications

    1. 1. Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario
    2. 2. DNA Barcodes <ul><li>166 Full Eukaryotic genomes </li></ul><ul><li>2,471 Metazoan Mitochondrial Genomes </li></ul><ul><li>1,444,076 Barcodes - ~118,000 species </li></ul><ul><li>DNA Barcodes represent an enormous resource for researchers of all types. </li></ul>
    3. 3. Applications <ul><li>Species Identification </li></ul><ul><li>Taxonomy </li></ul><ul><li>Building the Reference Library </li></ul><ul><li>Ecology </li></ul><ul><li>Proteomics </li></ul><ul><li>Comparative Genomics </li></ul><ul><li>Teaching </li></ul><ul><li>Music </li></ul>
    4. 4. High level data flow Museums Private collections Regulatory Agencies Researchers CCDB BOLD Genbank Mirrors Educators Researchers Regulatory Agencies Australian Museum
    5. 5. Typical Informatics Workflow Filtered Data Aligned Data Cleaned Data BOLD Align Data Identify Problematic Sequences Analyze Data Extract Data Local Copy Filter Data
    6. 6. Extracting Data: BOLD Public <ul><li>Easy to use </li></ul><ul><li>Flexible search tool </li></ul><ul><ul><li>Search by taxonomic name, geographic region, collector, etc. </li></ul></ul><ul><ul><li>Example Searches: “Hymenoptera”, “Lepidoptera Canada” </li></ul></ul>
    7. 7. Extracting Data: BOLD Public <ul><li>Provides data in .tsv, fasta, and xml formats. </li></ul><ul><li>Can select sequence data, trace files, specimen data, combined data. </li></ul>
    8. 8. Extracting Data: web services <ul><li>Provides data in tsv (tab separated value) and xml formats </li></ul><ul><li>Sequence data or full records </li></ul><ul><li>Can be used to provide a complete dump of all public BOLD data </li></ul>http://services.boldsystems.org/
    9. 9. Extracting Data: web services <ul><li>Working with the raw data allows for custom queries </li></ul><ul><li>Not all fields are available as search terms in BOLD Public </li></ul><ul><li>Requires scripting knowledge, or a lot of patience with excel </li></ul><ul><li>Example: All plants above 2000 ft, etc. </li></ul>
    10. 10. Filter Data <ul><li>The Barcode data is collected from a wide variety of independent investigations </li></ul><ul><li>High degree of taxonomic bias </li></ul><ul><li>Tentative Names </li></ul><ul><li>Variable sequence quality </li></ul>
    11. 11. Impact of Alignment Alignment Build Phylogenetic Trees Nearest Neighbor Analysis Clustering Distance Matrices
    12. 12. Impact of Alignment Pairwise Sequence Alignment Muscle Multiple Sequence Alignment
    13. 13. Aligning Animal Barcode Data CO1 Barcode Short CO1 3’ CO1’ Full CO1 sequence Barcode Even a gene as straightforward as CO1 can provide alignment challenges. 5’ 3’
    14. 14. Aligning Barcode Data <ul><li>Multiple Sequence Alignment </li></ul><ul><ul><li>Accurate </li></ul></ul><ul><ul><li>Slow (a thousand sequences can take hours) </li></ul></ul><ul><ul><li>Trouble with variable sequences </li></ul></ul><ul><li>Pairwise Sequence Alignment </li></ul><ul><ul><li>Fast (Thousands of sequences in minutes) </li></ul></ul><ul><ul><li>Inconsistent placement of indels </li></ul></ul><ul><ul><li>Highly dependent on choosing the right reference </li></ul></ul><ul><li>Parameters </li></ul><ul><ul><li>Amino Acid vs Nucleotide </li></ul></ul><ul><ul><li>Gap Penalty </li></ul></ul>
    15. 15. Uploading your alignment to BOLD <ul><li>Upload in fasta format </li></ul><ul><li>Edit sequence permission on the records </li></ul>
    16. 16. Identifying Problems <ul><li>Stop codons – Automatically annotated for coding regions </li></ul><ul><ul><li>Even stop codons can be tricky </li></ul></ul><ul><li>Frame shifts </li></ul><ul><li>Ambiguous characters </li></ul><ul><li>Chimeric sequences </li></ul>
    17. 17. Identifying Problems: Frame Shifts <ul><li>Frame-shifts in the middle of the sequence are disruptive and easy to spot </li></ul><ul><li>Frame-shifts at the ends of the sequence are more challenging </li></ul>
    18. 18. Identifying Problems: Chimeric Sequences <ul><li>Identify change points </li></ul><ul><li>Split the sequence at the point of discontinuity </li></ul><ul><li>Blast each part </li></ul>Hymenoptera Hymenoptera Lepidoptera Chimera Lepidoptera
    19. 19. Cleaning Data: Updating BOLD <ul><li>BOLD is curated by the community </li></ul><ul><ul><li>Re-upload sequences </li></ul></ul><ul><ul><li>Delete sequences </li></ul></ul><ul><ul><li>Annotate sequences </li></ul></ul><ul><ul><li>Flag sequences </li></ul></ul>BOLD Genbank Mirrors Educators Researchers Regulatory Agencies
    20. 20. Example Workflow: Occurrence of Indels Download public BOLD Hymenoptera ecords using webservices Select sequences with full taxonomy Align sequences using MAAFT, Muscle, Transalign Select one representative per species Remove problematic Sequences Tree Map sequences onto phylogeny
    21. 21. Example Workflow: Code shifts Download public BOLD Hymenoptera ecords using webservices 80,000 sequences – Align pairwise Scan sequences for code shifts Remove problematic sequences Analyze results
    22. 22. Acknowledgements <ul><li>Paul Hebert </li></ul><ul><li>Sujeeven Ratnasingham </li></ul><ul><li>The BOLD Team </li></ul>