Jan2015 giab bioinformatics summary

Characterization/Bioinformatics
Working Group
Chunlin Xiao

GIAB data in SRA finally
• Many new project data being generated since last workshop
– HG001, Ashkenazim trio, Asia trio
– Illumina HiSeq, CG, Ion
– Long read technologies: PacBio, Bionano, CG LFR, Illumina Moleculo
• Groups start to submit project data to SRA directly
• For existing project data on ftp site, we are submitting to SRA
– no need to resubmit by original submitters
• GIAB BioProject (200694) page updated to
– cover NA12878, Ashkenazim trio, and Asia trio
– Some summary info, but need more eg coverage by sample
• GIAB NIST reference samples are distinguishable from Coriell (eg. NIST_HG001
NA12878, NIST_HG002 NA24385 etc)

Data release and usage
• ftp structure changed to make official data easier to find
– For NA12878-HG001, the most recent VCF is ALWAYS under “latest” directory ( ftp://ftp-
trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/ )
– one vcf file, one bed file, and one readme file as suggested from last workshop
• The monthly download from GIAB ftp site has increased (10~15k per month
after Aug-2014; 30k Nov-2014)

Korean Genome Project
• Changhoon Kim from Macrogen
• Provided a good use case for integrating long read from PacBio (72X) and
HiSeq PE sequencing (60X pcr free)
• BAC clones with end sequenced(~ 83k clones)
• De nova assembly with Falcon – produced AK1 diploid genomes
– 2.787 G sequences assembled
– 5806 contig with N50 5.78 MB

GRCh38
• Deanna Church from Personalis
• GRCh38 assembly many improvement over GRCh37
– The new assembly model towards the realization of a graph-based assembly representation
– alternative sequence representations for regions of excess diversity
– GRCh37 annotation improvements
– 178 regions with alt loci, and there are alt loci specific genes
• Name convention and standardization (eg sequence_name, data reporting
format etc) for facilitating data analysis/exchange
• Analysis tool should be “Alt aware”
– Current tools and data structures are expecting flat assembly
– Remap/liftover for variants not the best answer
• Suggesting alt sequences should be included in alignment and variant analysis
to improve accuracy

NIST Integration
• Integrated NA12878 SNP/indel alls (from GIAB + RTG + Platinum) available
(NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz)
• Integrating Illumina, CG & Ion Torrent for NA12878 and PGP trios
• Integrating long read technologies (PacBio, Bionano, CG LFR, Illumina
Moleculo) for NA12878 and PGP trios
• Integrating latest Platinum data
• Structural Variant calling
• Dealing with Complex regions
• Analysis team building

Jan2015 giab bioinformatics summary

More Related Content

What's hot

Viewers also liked

Similar to Jan2015 giab bioinformatics summary

More from GenomeInABottle

Recently uploaded

Jan2015 giab bioinformatics summary

Editor's Notes