Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dr David Schindel and Mike Trizna - BOL Data Portal


Published on

Using BOL in conjunction with BOLD, its capabilities and an example of the case study; Smithsonian frozen bird tissue project

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Dr David Schindel and Mike Trizna - BOL Data Portal

  1. 1. The Barcode of Life Data Portal( Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC; and
  2. 2. Contents of PresentationCrowd-sourced open source softwareHow does Data Portal complement BOLDand GenBank?Data Portal capabilitiesCase Study: Smithsonian frozen birdtissue project
  3. 3. An Experiment in Museum Tissue Mining and Fast Data Release Tissue sampling winter/spring Sequencing completed in September Sequence quality control in October Taxonomic checking in early November – Obvious errors removed – Minor discrepancies remain Data released for Adelaide Conference – Crowd-sourced annotation by community – Will data be mis-used?
  4. 4. Unique Data Portal Capabilities Creating customized datasets from public and/or your private data Online library of standard datasets Support sharing within project teams using Connect IDs, easy link to Working Groups Running different identification analyses based on different methodologies: – Standard sequence input using FASTA format – Use standard or customized datasets
  5. 5. Barcode Aggregator 727,170 public records
  6. 6. Summary Statistics per Family
  7. 7. Creating Customized Datasets
  8. 8. Existing Data Analysis Packages LIST of packages – BLOG – BRONX – Kernel – CAOS – USEARCH – BLAST Output of identification routines as probabilities of assignment
  9. 9. Data Analysis Methods Session New packages presented Friday afternoon: – Damon Little: Automatic Plants Barcode pipeline (from raw traces to trimmed/edited sequences) – Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree- based analysis) – Alain Franc: Matching Next Generation results to Sanger-based reference records
  10. 10. Sample output
  11. 11. CONNECT for Data Portal Collaboration
  12. 12. The USNM Bird ProjectUSNM Division of Birds frozen tissuecollection:– 21,104 specimens, 2512 speciesWhich new ones ones to sample/barcode?Public records for birds– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388
  13. 13. Moving Data Among BOLD, GenBank, Data Portal USNM Excel BOLD Spreadsheet Split into projects that(KE-Emu Source) consist of 2-4 platesLocal database that Data Portalholds all fields from Aggregator the original database spreadsheet
  14. 14. Creating a ‘Pick List’Spreadsheet of tissue samples comparedwith:– ITIS taxonomy– Clemens species list in BOLD– Counts of GenBank and/or public BOLD records– Geographic informattionScreenshot of USNM list side-by-side withBOLD records
  15. 15. Identifying Samples to be Subsampled
  16. 16. Side-by-Side Lists
  17. 17. USNM Bird Dataset3150 tissues sampled168 failed sequences94 problematic sequences166 clustered badly2761 ‘BARCODE-ready’ samples1,147 ‘first-BARCODE’ species91% increase over 1,259 barcoded species(3,892 listed in BOLD includes BINs, others)
  18. 18. Two problematic clades, USNM data Flycatchers: Family Tyrannidae – Sublegatus arenarum, S. modestus, S. obscurior, S. sp. – Conopias parvus, C. albovittatus – Myiarchus ferox, M. swainsoni, M. sp. Hummingbirds: Family Trochilidae – Phaethornis longuemareus Inconsistencies within USNM dataset Incompatibilities with public, other data
  19. 19. Resolving Mis-identified Specimens
  20. 20. What testing dataset to use?ID trees and analytical routines could use:– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388Which ones have reliable taxonomic IDs?
  21. 21. Preparing a Data Release Paper Summary statistics from Data Portal Figures from BOLD