Dr Justin Schonfeld - Bioinformatics Applications
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Dr Justin Schonfeld - Bioinformatics Applications

  • 1,546 views
Uploaded on

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

Analysis of typical informatics workflow, extracting data, aligning data, identifying problems and uploading data to BOLD

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,546
On Slideshare
1,546
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
29
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 94 fungi, 55 plant, 83 other
  • Does BOLD filter stop codons

Transcript

  • 1. Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario
  • 2. DNA Barcodes
    • 166 Full Eukaryotic genomes
    • 2,471 Metazoan Mitochondrial Genomes
    • 1,444,076 Barcodes - ~118,000 species
    • DNA Barcodes represent an enormous resource for researchers of all types.
  • 3. Applications
    • Species Identification
    • Taxonomy
    • Building the Reference Library
    • Ecology
    • Proteomics
    • Comparative Genomics
    • Teaching
    • Music
  • 4. High level data flow Museums Private collections Regulatory Agencies Researchers CCDB BOLD Genbank Mirrors Educators Researchers Regulatory Agencies Australian Museum
  • 5. Typical Informatics Workflow Filtered Data Aligned Data Cleaned Data BOLD Align Data Identify Problematic Sequences Analyze Data Extract Data Local Copy Filter Data
  • 6. Extracting Data: BOLD Public
    • Easy to use
    • Flexible search tool
      • Search by taxonomic name, geographic region, collector, etc.
      • Example Searches: “Hymenoptera”, “Lepidoptera Canada”
  • 7. Extracting Data: BOLD Public
    • Provides data in .tsv, fasta, and xml formats.
    • Can select sequence data, trace files, specimen data, combined data.
  • 8. Extracting Data: web services
    • Provides data in tsv (tab separated value) and xml formats
    • Sequence data or full records
    • Can be used to provide a complete dump of all public BOLD data
    http://services.boldsystems.org/
  • 9. Extracting Data: web services
    • Working with the raw data allows for custom queries
    • Not all fields are available as search terms in BOLD Public
    • Requires scripting knowledge, or a lot of patience with excel
    • Example: All plants above 2000 ft, etc.
  • 10. Filter Data
    • The Barcode data is collected from a wide variety of independent investigations
    • High degree of taxonomic bias
    • Tentative Names
    • Variable sequence quality
  • 11. Impact of Alignment Alignment Build Phylogenetic Trees Nearest Neighbor Analysis Clustering Distance Matrices
  • 12. Impact of Alignment Pairwise Sequence Alignment Muscle Multiple Sequence Alignment
  • 13. Aligning Animal Barcode Data CO1 Barcode Short CO1 3’ CO1’ Full CO1 sequence Barcode Even a gene as straightforward as CO1 can provide alignment challenges. 5’ 3’
  • 14. Aligning Barcode Data
    • Multiple Sequence Alignment
      • Accurate
      • Slow (a thousand sequences can take hours)
      • Trouble with variable sequences
    • Pairwise Sequence Alignment
      • Fast (Thousands of sequences in minutes)
      • Inconsistent placement of indels
      • Highly dependent on choosing the right reference
    • Parameters
      • Amino Acid vs Nucleotide
      • Gap Penalty
  • 15. Uploading your alignment to BOLD
    • Upload in fasta format
    • Edit sequence permission on the records
  • 16. Identifying Problems
    • Stop codons – Automatically annotated for coding regions
      • Even stop codons can be tricky
    • Frame shifts
    • Ambiguous characters
    • Chimeric sequences
  • 17. Identifying Problems: Frame Shifts
    • Frame-shifts in the middle of the sequence are disruptive and easy to spot
    • Frame-shifts at the ends of the sequence are more challenging
  • 18. Identifying Problems: Chimeric Sequences
    • Identify change points
    • Split the sequence at the point of discontinuity
    • Blast each part
    Hymenoptera Hymenoptera Lepidoptera Chimera Lepidoptera
  • 19. Cleaning Data: Updating BOLD
    • BOLD is curated by the community
      • Re-upload sequences
      • Delete sequences
      • Annotate sequences
      • Flag sequences
    BOLD Genbank Mirrors Educators Researchers Regulatory Agencies
  • 20. Example Workflow: Occurrence of Indels Download public BOLD Hymenoptera ecords using webservices Select sequences with full taxonomy Align sequences using MAAFT, Muscle, Transalign Select one representative per species Remove problematic Sequences Tree Map sequences onto phylogeny
  • 21. Example Workflow: Code shifts Download public BOLD Hymenoptera ecords using webservices 80,000 sequences – Align pairwise Scan sequences for code shifts Remove problematic sequences Analyze results
  • 22. Acknowledgements
    • Paul Hebert
    • Sujeeven Ratnasingham
    • The BOLD Team