Your SlideShare is downloading. ×
Dr Justin Schonfeld - Bioinformatics Applications
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dr Justin Schonfeld - Bioinformatics Applications

1,163
views

Published on

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,163
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 94 fungi, 55 plant, 83 other
  • Does BOLD filter stop codons
  • Transcript

    • 1. Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario
    • 2. DNA Barcodes
      • 166 Full Eukaryotic genomes
      • 2,471 Metazoan Mitochondrial Genomes
      • 1,444,076 Barcodes - ~118,000 species
      • DNA Barcodes represent an enormous resource for researchers of all types.
    • 3. Applications
      • Species Identification
      • Taxonomy
      • Building the Reference Library
      • Ecology
      • Proteomics
      • Comparative Genomics
      • Teaching
      • Music
    • 4. High level data flow Museums Private collections Regulatory Agencies Researchers CCDB BOLD Genbank Mirrors Educators Researchers Regulatory Agencies Australian Museum
    • 5. Typical Informatics Workflow Filtered Data Aligned Data Cleaned Data BOLD Align Data Identify Problematic Sequences Analyze Data Extract Data Local Copy Filter Data
    • 6. Extracting Data: BOLD Public
      • Easy to use
      • Flexible search tool
        • Search by taxonomic name, geographic region, collector, etc.
        • Example Searches: “Hymenoptera”, “Lepidoptera Canada”
    • 7. Extracting Data: BOLD Public
      • Provides data in .tsv, fasta, and xml formats.
      • Can select sequence data, trace files, specimen data, combined data.
    • 8. Extracting Data: web services
      • Provides data in tsv (tab separated value) and xml formats
      • Sequence data or full records
      • Can be used to provide a complete dump of all public BOLD data
      http://services.boldsystems.org/
    • 9. Extracting Data: web services
      • Working with the raw data allows for custom queries
      • Not all fields are available as search terms in BOLD Public
      • Requires scripting knowledge, or a lot of patience with excel
      • Example: All plants above 2000 ft, etc.
    • 10. Filter Data
      • The Barcode data is collected from a wide variety of independent investigations
      • High degree of taxonomic bias
      • Tentative Names
      • Variable sequence quality
    • 11. Impact of Alignment Alignment Build Phylogenetic Trees Nearest Neighbor Analysis Clustering Distance Matrices
    • 12. Impact of Alignment Pairwise Sequence Alignment Muscle Multiple Sequence Alignment
    • 13. Aligning Animal Barcode Data CO1 Barcode Short CO1 3’ CO1’ Full CO1 sequence Barcode Even a gene as straightforward as CO1 can provide alignment challenges. 5’ 3’
    • 14. Aligning Barcode Data
      • Multiple Sequence Alignment
        • Accurate
        • Slow (a thousand sequences can take hours)
        • Trouble with variable sequences
      • Pairwise Sequence Alignment
        • Fast (Thousands of sequences in minutes)
        • Inconsistent placement of indels
        • Highly dependent on choosing the right reference
      • Parameters
        • Amino Acid vs Nucleotide
        • Gap Penalty
    • 15. Uploading your alignment to BOLD
      • Upload in fasta format
      • Edit sequence permission on the records
    • 16. Identifying Problems
      • Stop codons – Automatically annotated for coding regions
        • Even stop codons can be tricky
      • Frame shifts
      • Ambiguous characters
      • Chimeric sequences
    • 17. Identifying Problems: Frame Shifts
      • Frame-shifts in the middle of the sequence are disruptive and easy to spot
      • Frame-shifts at the ends of the sequence are more challenging
    • 18. Identifying Problems: Chimeric Sequences
      • Identify change points
      • Split the sequence at the point of discontinuity
      • Blast each part
      Hymenoptera Hymenoptera Lepidoptera Chimera Lepidoptera
    • 19. Cleaning Data: Updating BOLD
      • BOLD is curated by the community
        • Re-upload sequences
        • Delete sequences
        • Annotate sequences
        • Flag sequences
      BOLD Genbank Mirrors Educators Researchers Regulatory Agencies
    • 20. Example Workflow: Occurrence of Indels Download public BOLD Hymenoptera ecords using webservices Select sequences with full taxonomy Align sequences using MAAFT, Muscle, Transalign Select one representative per species Remove problematic Sequences Tree Map sequences onto phylogeny
    • 21. Example Workflow: Code shifts Download public BOLD Hymenoptera ecords using webservices 80,000 sequences – Align pairwise Scan sequences for code shifts Remove problematic sequences Analyze results
    • 22. Acknowledgements
      • Paul Hebert
      • Sujeeven Ratnasingham
      • The BOLD Team