Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why graph genome storage and updating wakes me up at 4 am

396 views

Published on

Presentation at PanGenomics in the Cloud Hackathon, run by NCBI at UCSC (https://ncbiinsights.ncbi.nlm.nih.gov/2019/02/06/pangenomics-cloud-hackathon-march-2019/). Presents points to consider about the adoption of a pangenome reference, emphasizing aspects for long-term data management and wide-spread adoption.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Why graph genome storage and updating wakes me up at 4 am

  1. 1. Why Graph Genome Storage and Updating Wakes Me Up at 4 am Valerie Schneider NCBI/GRC
  2. 2. GRCh38 Curation • Spoiler alert: GRCh38 isn’t perfect! • >300 unresolved issues remain • Incorrectly assembled seg dups • Catch-22: Constant coordinates vs. correct sequence? • Patch releases to date (GRCh38.p13): • 72 novel patches (future alt loci) • 113 fix patches! • Gap closures/extensions • Path updates (replacements, rearrangements) • There are other updates that can’t be released as patches • Sequence removal (including alt loci) chromosome novel patch scaffold fix patch scaffold
  3. 3. More genome assemblies • Managing updates: incremental or full rebuilds? • Impact of assembly quality on the pan-genome? • 7.7 billion people: 300 genomes (99% of MAF 1%) • Intra-population diversity • Under-representation from Africa, middle East and Oceanic populations
  4. 4. Pan-genome reference data definition • What does the reference become? • Collection of assemblies • Graph representation • A “golden” path • Specific representations • Data representation • Identifier for the pan-genome (e.g. GCA_000001405.$$) • Versioning: what changes trigger an update? • Distributed data: what authority manages updates? • File formats: sequence = FASTA; graph = ? Graph-based annotations = VCF, BED, GFF, ?? • Metadata • Assembly quality (old: finishing status, alignment criteria) Today’s reference assembly does not represent: 1. The most common allele/haplotype 2. The longest allele/haplotype 3. The ancestral allele/haplotype
  5. 5. Diverse users, diverse needs • Mapping reads • Coordinate system • Annotations • Relating samples to one another • Visualization (as a means for analysis) • Clinical reporting • Regulations for reporting on a graph? • Truth sets, documented changes essential • Clinical tools lag by at least 1 year • And the tools to support these things…

×