140127 bioinformatics wg summary

444 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
444
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • ----- Meeting Notes (1/28/14 17:34) -----pipeline reproducible by the time we char second genome
  • 140127 bioinformatics wg summary

    1. 1. GIAB Data Access -README.ftp_structure -README.sequence_data -alignment_indices -current.sequence.index -current.tree ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/ s3://giab/ -release -sequence_indices -technical -tools -RTG -data -NA12878 -sequence_read -alignment -variant_call -COMPLETE -GARVAN -INOVA -NIST -RTG -ILLUMINA -NCI -NOVARTIS -NA12891 -NA12892
    2. 2. GIAB Data Submitter/Submission Submitter Experiments Variant calls submitted # Calls NIST Integrated call set Highly confident set (NISTIntegratedC alls_14datasets_1 31103_allcall_UG HapMerge_HetH omVarPASS_VQS Rv2.18_all.primiti ves.vcf.gz) 3,877,721 COMPLETE WGS vcfBetaGS000025639ASM_2ad.tsv.gz 8,815,397 LFR/GS01957DNA_F09.vcf Bams submitted Fastq submitted Software submitted 12,763,430 yes GARVAN Exome (testing 6 vials ) project.NIST.hc.s nps.indels.vcf 416,689 yes ILLUMINA Platinum genomes, pedigree call NA12878_S1_200 x.genome.vcf.gz 8,815,397 yes INOVA SNP array NA12878_GIAB_ CytoSNP850K_FinalReport .120813.txt 851,274 RTG Pedigree call singletonilluminawgs.vcf.gz 4,589,551 phasing_annotat ed.vcf.gz) 7,085,181 NCI Proton runs yes rtg-tools-1.0.0
    3. 3. Feedbacks from Bioinformatics Group • ftp structure – Separate pedigree call and add a new folder – Tools folder (documents for each of the tools used by the consortium) • Accessioning of data submitted to repository – Meta data collection – Archiving fastq, bams, and vcfs at NCBI – VCF 4.1 spec • Data Model Global Alliance initiatives Long range reads Optical mapping APIs (google, NCBI …) Fast format converter is essential • GRCh38: – primary + alternate sequences, rich set of sequences (eg. HLAs), better than 1kg based decoy reference – complexity for the analysis – Easy-version of Build38 (GIAB defined primary), currently b37 still essential – B37 based Highly-confident set lifting over to B38 and performing analysis – Figuring out alternate regions for NA12878 – What is the consortium release of integrated set (other than Justin’s NIST set) – Defining what product we are going to release as a consortium (documentation of integration) • Hard region characterization – Regions of discordant – clone validation – SRA data – b37/b38 alignment assessment – List of such regions (defining the candidate regions, sharing such regions) – Pacbio data for NA12878 from 1000genomes project • Low coverage data – Tools (API) – Download utilities for comparing the performance – Population analysis (for related individuals vs un-related individual
    4. 4. Bioinformatics • Data Integration – NIST with pedigree calls – phasing with pedigree – some people interested in looking at utility of long read methods • Call Quality metrics – incorporate lots of INFO fields (GQ, pedigree, # of platforms, …) • VCF call representation – Do some regularization of submitted datasets for display (vcfallelicprimitives, bcbio, vcfeval) – overlap with Performance Metrics • Pipeline reproducibility – Arvados, DNAnexus, NCBI – run on Amazon and Google?

    ×