Your SlideShare is downloading. ×
140127 bioinformatics wg summary
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

140127 bioinformatics wg summary

206
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
206
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • ----- Meeting Notes (1/28/14 17:34) -----pipeline reproducible by the time we char second genome
  • Transcript

    • 1. GIAB Data Access -README.ftp_structure -README.sequence_data -alignment_indices -current.sequence.index -current.tree ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/ s3://giab/ -release -sequence_indices -technical -tools -RTG -data -NA12878 -sequence_read -alignment -variant_call -COMPLETE -GARVAN -INOVA -NIST -RTG -ILLUMINA -NCI -NOVARTIS -NA12891 -NA12892
    • 2. GIAB Data Submitter/Submission Submitter Experiments Variant calls submitted # Calls NIST Integrated call set Highly confident set (NISTIntegratedC alls_14datasets_1 31103_allcall_UG HapMerge_HetH omVarPASS_VQS Rv2.18_all.primiti ves.vcf.gz) 3,877,721 COMPLETE WGS vcfBetaGS000025639ASM_2ad.tsv.gz 8,815,397 LFR/GS01957DNA_F09.vcf Bams submitted Fastq submitted Software submitted 12,763,430 yes GARVAN Exome (testing 6 vials ) project.NIST.hc.s nps.indels.vcf 416,689 yes ILLUMINA Platinum genomes, pedigree call NA12878_S1_200 x.genome.vcf.gz 8,815,397 yes INOVA SNP array NA12878_GIAB_ CytoSNP850K_FinalReport .120813.txt 851,274 RTG Pedigree call singletonilluminawgs.vcf.gz 4,589,551 phasing_annotat ed.vcf.gz) 7,085,181 NCI Proton runs yes rtg-tools-1.0.0
    • 3. Feedbacks from Bioinformatics Group • ftp structure – Separate pedigree call and add a new folder – Tools folder (documents for each of the tools used by the consortium) • Accessioning of data submitted to repository – Meta data collection – Archiving fastq, bams, and vcfs at NCBI – VCF 4.1 spec • Data Model Global Alliance initiatives Long range reads Optical mapping APIs (google, NCBI …) Fast format converter is essential • GRCh38: – primary + alternate sequences, rich set of sequences (eg. HLAs), better than 1kg based decoy reference – complexity for the analysis – Easy-version of Build38 (GIAB defined primary), currently b37 still essential – B37 based Highly-confident set lifting over to B38 and performing analysis – Figuring out alternate regions for NA12878 – What is the consortium release of integrated set (other than Justin’s NIST set) – Defining what product we are going to release as a consortium (documentation of integration) • Hard region characterization – Regions of discordant – clone validation – SRA data – b37/b38 alignment assessment – List of such regions (defining the candidate regions, sharing such regions) – Pacbio data for NA12878 from 1000genomes project • Low coverage data – Tools (API) – Download utilities for comparing the performance – Population analysis (for related individuals vs un-related individual
    • 4. Bioinformatics • Data Integration – NIST with pedigree calls – phasing with pedigree – some people interested in looking at utility of long read methods • Call Quality metrics – incorporate lots of INFO fields (GQ, pedigree, # of platforms, …) • VCF call representation – Do some regularization of submitted datasets for display (vcfallelicprimitives, bcbio, vcfeval) – overlap with Performance Metrics • Pipeline reproducibility – Arvados, DNAnexus, NCBI – run on Amazon and Google?