Characterization/Bioinformatics
Working Group
Chunlin Xiao
GIAB data in SRA finally
• Many new project data being generated since last workshop
– HG001, Ashkenazim trio, Asia trio
– Illumina HiSeq, CG, Ion
– Long read technologies: PacBio, Bionano, CG LFR, Illumina Moleculo
• Groups start to submit project data to SRA directly
• For existing project data on ftp site, we are submitting to SRA
– no need to resubmit by original submitters
• GIAB BioProject (200694) page updated to
– cover NA12878, Ashkenazim trio, and Asia trio
– Some summary info, but need more eg coverage by sample
• GIAB NIST reference samples are distinguishable from Coriell (eg. NIST_HG001
NA12878, NIST_HG002 NA24385 etc)
Data release and usage
• ftp structure changed to make official data easier to find
– For NA12878-HG001, the most recent VCF is ALWAYS under “latest” directory ( ftp://ftp-
trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/ )
– one vcf file, one bed file, and one readme file as suggested from last workshop
• The monthly download from GIAB ftp site has increased (10~15k per month
after Aug-2014; 30k Nov-2014)
Korean Genome Project
• Changhoon Kim from Macrogen
• Provided a good use case for integrating long read from PacBio (72X) and
HiSeq PE sequencing (60X pcr free)
• BAC clones with end sequenced(~ 83k clones)
• De nova assembly with Falcon – produced AK1 diploid genomes
– 2.787 G sequences assembled
– 5806 contig with N50 5.78 MB
GRCh38
• Deanna Church from Personalis
• GRCh38 assembly many improvement over GRCh37
– The new assembly model towards the realization of a graph-based assembly representation
– alternative sequence representations for regions of excess diversity
– GRCh37 annotation improvements
– 178 regions with alt loci, and there are alt loci specific genes
• Name convention and standardization (eg sequence_name, data reporting
format etc) for facilitating data analysis/exchange
• Analysis tool should be “Alt aware”
– Current tools and data structures are expecting flat assembly
– Remap/liftover for variants not the best answer
• Suggesting alt sequences should be included in alignment and variant analysis
to improve accuracy
NIST Integration
• Integrated NA12878 SNP/indel alls (from GIAB + RTG + Platinum) available
(NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz)
• Integrating Illumina, CG & Ion Torrent for NA12878 and PGP trios
• Integrating long read technologies (PacBio, Bionano, CG LFR, Illumina
Moleculo) for NA12878 and PGP trios
• Integrating latest Platinum data
• Structural Variant calling
• Dealing with Complex regions
• Analysis team building

Jan2015 giab bioinformatics summary

  • 1.
  • 2.
    GIAB data inSRA finally • Many new project data being generated since last workshop – HG001, Ashkenazim trio, Asia trio – Illumina HiSeq, CG, Ion – Long read technologies: PacBio, Bionano, CG LFR, Illumina Moleculo • Groups start to submit project data to SRA directly • For existing project data on ftp site, we are submitting to SRA – no need to resubmit by original submitters • GIAB BioProject (200694) page updated to – cover NA12878, Ashkenazim trio, and Asia trio – Some summary info, but need more eg coverage by sample • GIAB NIST reference samples are distinguishable from Coriell (eg. NIST_HG001 NA12878, NIST_HG002 NA24385 etc)
  • 3.
    Data release andusage • ftp structure changed to make official data easier to find – For NA12878-HG001, the most recent VCF is ALWAYS under “latest” directory ( ftp://ftp- trace.ncbi.nih.gov/giab/ftp/release/NA12878_HG001/latest/ ) – one vcf file, one bed file, and one readme file as suggested from last workshop • The monthly download from GIAB ftp site has increased (10~15k per month after Aug-2014; 30k Nov-2014)
  • 4.
    Korean Genome Project •Changhoon Kim from Macrogen • Provided a good use case for integrating long read from PacBio (72X) and HiSeq PE sequencing (60X pcr free) • BAC clones with end sequenced(~ 83k clones) • De nova assembly with Falcon – produced AK1 diploid genomes – 2.787 G sequences assembled – 5806 contig with N50 5.78 MB
  • 5.
    GRCh38 • Deanna Churchfrom Personalis • GRCh38 assembly many improvement over GRCh37 – The new assembly model towards the realization of a graph-based assembly representation – alternative sequence representations for regions of excess diversity – GRCh37 annotation improvements – 178 regions with alt loci, and there are alt loci specific genes • Name convention and standardization (eg sequence_name, data reporting format etc) for facilitating data analysis/exchange • Analysis tool should be “Alt aware” – Current tools and data structures are expecting flat assembly – Remap/liftover for variants not the best answer • Suggesting alt sequences should be included in alignment and variant analysis to improve accuracy
  • 6.
    NIST Integration • IntegratedNA12878 SNP/indel alls (from GIAB + RTG + Platinum) available (NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz) • Integrating Illumina, CG & Ion Torrent for NA12878 and PGP trios • Integrating long read technologies (PacBio, Bionano, CG LFR, Illumina Moleculo) for NA12878 and PGP trios • Integrating latest Platinum data • Structural Variant calling • Dealing with Complex regions • Analysis team building

Editor's Notes

  • #7 Merged NA12878 calls (from GIAB + RTG + PG) available NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz