Dr David Schindel and Mike Trizna - BOL Data Portal
The Barcode of Life Data Portal(http://bol.uvm.edu) Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC www.barcodeoflife.org; SchindelD@si.edu and TriznaM@si.edu
Contents of PresentationCrowd-sourced open source softwareHow does Data Portal complement BOLDand GenBank?Data Portal capabilitiesCase Study: Smithsonian frozen birdtissue project
An Experiment in Museum Tissue Mining and Fast Data Release Tissue sampling winter/spring Sequencing completed in September Sequence quality control in October Taxonomic checking in early November – Obvious errors removed – Minor discrepancies remain Data released for Adelaide Conference – Crowd-sourced annotation by community – Will data be mis-used?
Unique Data Portal Capabilities Creating customized datasets from public and/or your private data Online library of standard datasets Support sharing within project teams using Connect IDs, easy link to Working Groups Running different identification analyses based on different methodologies: – Standard sequence input using FASTA format – Use standard or customized datasets
Existing Data Analysis Packages LIST of packages – BLOG – BRONX – Kernel – CAOS – USEARCH – BLAST Output of identification routines as probabilities of assignment
Data Analysis Methods Session New packages presented Friday afternoon: – Damon Little: Automatic Plants Barcode pipeline (from raw traces to trimmed/edited sequences) – Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree- based analysis) – Alain Franc: Matching Next Generation results to Sanger-based reference records
The USNM Bird ProjectUSNM Division of Birds frozen tissuecollection:– 21,104 specimens, 2512 speciesWhich new ones ones to sample/barcode?Public records for birds– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388
Moving Data Among BOLD, GenBank, Data Portal USNM Excel BOLD Spreadsheet Split into projects that(KE-Emu Source) consist of 2-4 platesLocal database that Data Portalholds all fields from Aggregator the original database spreadsheet
Creating a ‘Pick List’Spreadsheet of tissue samples comparedwith:– ITIS taxonomy– Clemens species list in BOLD– Counts of GenBank and/or public BOLD records– Geographic informattionScreenshot of USNM list side-by-side withBOLD records
USNM Bird Dataset3150 tissues sampled168 failed sequences94 problematic sequences166 clustered badly2761 ‘BARCODE-ready’ samples1,147 ‘first-BARCODE’ species91% increase over 1,259 barcoded species(3,892 listed in BOLD includes BINs, others)
Two problematic clades, USNM data Flycatchers: Family Tyrannidae – Sublegatus arenarum, S. modestus, S. obscurior, S. sp. – Conopias parvus, C. albovittatus – Myiarchus ferox, M. swainsoni, M. sp. Hummingbirds: Family Trochilidae – Phaethornis longuemareus Inconsistencies within USNM dataset Incompatibilities with public, other data
What testing dataset to use?ID trees and analytical routines could use:– All public bird COI records: 10,967– All BARCODE records in GenBank: 8,419– BARCODE with taxonomic names: 7,965– BARCODE, name and 2 traces: 2,388Which ones have reliable taxonomic IDs?
Preparing a Data Release Paper Summary statistics from Data Portal Figures from BOLD