Agbt2015 workshop schneider

•Download as PPTX, PDF•

2 likes•1,302 views

The human reference genome is becoming more complex, moving from a single consensus sequence to representing multiple haplotypes and genomic diversity. The current assembly model, GRCh38, includes 178 regions with alternative loci sequences totaling 3.6 Mb of novel sequence not present in previous assemblies. Future assemblies will aim to better define sequence contexts and provide coordinate information for multiple genomes and patches. Challenges include developing compatible analysis tools and determining how to best represent updated regions in new assembly releases.

Health & Medicine

Advancing the Human Reference Assembly
Valerie Schneider
NCBI
25 February 2015
The Human Reference Genome: Today, Tomorrow and Next ?
http://genomereference.org

Dilthey et al.Paten et al.
Scientific Models

Outline
• The assembly model
• Basics
• Value added
• Challenges
• Future relevance of the reference
• Multiple genomes
• Haploid genomes
• Assembly updates
• Mechanisms
• Requirements/Challenges

Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
Current Assembly model: represent both haplotypes
GRC Assembly Model
many

Assembly (e.g. GRCh38)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Church et al., PLoS Biol. 2011 Jul;9(7):e1001091
GRC Assembly Model
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
ALT
1

GRC Assembly Model
Alt loci alignments are an integral part of the assembly model
alignment to chr + scaffold sequence = Alt

GRCh38
• 178 regions with alt loci: 2% of chromosome
sequence (61.9 Mb)
• 261 Alt Loci: 3.6 Mb novel sequence relative to
chromosomes
• Average alt length = 400 kb, max = ~5 Mb
GRCh38

GRC Assembly Model
The human reference assembly represents population
genomic diversity in the context of linear sequences

GRCh38: Alt Loci
Alignment Legend
no alignmentmismatchdeletion

GRCh38: Alt Loci
GRCh38 alt
loci alignment
GRCh37 chr. 7

chromosome
alt/patch
reads On-target alignment
Off-target alignments
(n=122,922)
GRCh38: Alt Loci

GRC Assembly Model
http://notvine.co/
http://designtaxi.com/

Challenges: Reporting Multiple Locations

SRPRISM
Challenges: Solutions
https://github.com/samtools/hts-specs/issues/51

Multiple Genome Era
ADMIXTURE?
http://medcitynews.com/

Assembly (e.g. GRCh38.p1)
Primary
Assembly
Unit
Non-nuclear
assembly unit
(e.g. MT)
ALT
1
ALT
2
ALT
3
ALT
4
ALT
5
ALT
6
ALT
7
PAR
Genomic
Region
(MHC)
Genomic
Region
(UGT2B17)
Genomic
Region
(MAPT)
Patches
Genomic
Region
(ABO)
Genomic
Region
(FOXO6)
Genomic
Region
(FCGBP)
Assembly Updates
Patches
FIX NOVEL
SCAFFOLD STATUS AT NEXT
MAJOR ASSEMBLY RELEASE
ALT
LOCI
--
(integrated)
Treat as:
Allelic
Treat as:
Preferred

Assembly Updates
GRC
• Finished Quality
• INSDC Accessioned
• Representative of an actual DNA molecule
Criteria for Reference Assembly Component Sequences

Summary
• Reference Assembly: Today
• Multi-allelic
• Need compatible toolsuites
• Reference Assembly: Tomorrow
• Defining sequence context
• Providing coordinates
• Reference Assembly: Next ?
• Patches
• Challenges

GRCh38 Collaborators
• NCBI RefSeq and gpipe annotation team
• Havana annotators
• Karen Miga
• David Schwartz
• Steve Goldstein
• Mario Caceres
• Giulio Genovese
• Jeff Kidd
• Peter Lansdorp
• Mark Hills
• David Page
• Jim Knight
• Stephan Schuster
• 1000 Genomes
GRC SAB
• Rick Myers
• Granger Sutton
• Evan Eichler
• Jim Kent
• Roderic Guigo
• Carol Bult
• Derek Stemple
• Jan Korbel
• Liz Worthey
• Matthew Hurles
• Richard Gibbs
GRC Credits
Workshop sponsor:
http://genomereference.org

What's hot

Variant Calling II

Genome Reference Consortium

Getting the most from the reference assembly

Genome Reference Consortium

Schneider_AGBT2014

vaschn

Creating Reference-Grade Human Genome Assemblies

Genome Reference Consortium

agbt 2016 workshop church

Genome Reference Consortium

Ashg2015 grc-pruitt

Genome Reference Consortium

Grc workshop agbt2015_tg

Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC

Genome Reference Consortium

Alignment Approaches II: Long Reads

Genome Reference Consortium

Variation graphs and population assisted genome inference copy

Genome Reference Consortium

AGBT2017 Reference Workshop: Schneider

Genome Reference Consortium

hg19 (GRCh37) vs. hg38 (GRCh38)

Shaojun Xie

Creating Reference-Grade Human Genome Assemblies

Genome Reference Consortium

Ashg2017 workshop tg

Genome Reference Consortium

Explaining the assembly model

Genome Reference Consortium

20181016 grc presentation-pa

Genome Reference Consortium

Ashg2017 workshop schneider

Genome Reference Consortium

agbt 2016 workshop lindsay

Genome Reference Consortium

Schneider grc workshop_final

Genome Reference Consortium

Why graph genome storage and updating wakes me up at 4 am

Genome Reference Consortium

What's hot (20)

Variant Calling II

Getting the most from the reference assembly

Schneider_AGBT2014

Creating Reference-Grade Human Genome Assemblies

agbt 2016 workshop church

Ashg2015 grc-pruitt

Grc workshop agbt2015_tg

Previewing GRCm39: Assembly Updates from the GRC

Alignment Approaches II: Long Reads

Variation graphs and population assisted genome inference copy

AGBT2017 Reference Workshop: Schneider

hg19 (GRCh37) vs. hg38 (GRCh38)

Creating Reference-Grade Human Genome Assemblies

Ashg2017 workshop tg

Explaining the assembly model

20181016 grc presentation-pa

Ashg2017 workshop schneider

agbt 2016 workshop lindsay

Schneider grc workshop_final

Why graph genome storage and updating wakes me up at 4 am

Similar to Agbt2015 workshop schneider

A machine-learning view on heterogeneous catalyst design and discovery

Ichigaku Takigawa

Comparing Cahn-Ingold-Prelog Rule Implementations

NextMove Software

Concurrency Control for Parallel Machine Learning

jeykottalam

Church_GenomeAccess_2013_genome2013

Deanna Church

New data from giab genomes promethion

GenomeInABottle

Folker Meyer: Metagenomic Data Annotation

GigaScience, BGI Hong Kong

Biology-Derived Algorithms in Engineering Optimization

Xin-She Yang

Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN

Amr Rashed

CIBEC Presentation Fatma Sayed.pptx

Fatma Sayed Ibrahim

Integration of single molecule, genome mapping data in a web-based genome bro...

William Chow

ga-2.ppt

sayedmha

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...

NECST Lab @ Politecnico di Milano

Recent software and services to support the SBML community

Mike Hucka

NAME's Structure of the Grammatic Genome

A-Square Technology Group/Nascent Applied Methods and Endeavors

Second Order Heuristics in ACGPhauschildm

Applying the Scientific Method to Simulation Experiments

Frank Bergmann

In this talk I would like to explore on how to apply the scientific method to in silico experiments. How can we design these experiments, so that they are independent of the software tool that gave rise to them? Over the past decade we have seen the rise of model exchange formats such as the Systems Biology Markup Language (SBML), that enable us to share the models readily with colleagues and between applications. Here I present the Simulation Experiment Description Markup Language (SED-ML) that aims to do the same thing for in silico experiments. After detailing its history, and where it currently stands, I will give a short overview of the growing tool support.

Computational Approaches to Systems Biology

Mike Hucka

Cram 3.1 / Crumble

JamesBonfield

GIAB Sep2016 Lightning chen sun varmatch

GenomeInABottle

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...

Ganesan Narayanasamy

Similar to Agbt2015 workshop schneider (20)

A machine-learning view on heterogeneous catalyst design and discovery

Comparing Cahn-Ingold-Prelog Rule Implementations

Concurrency Control for Parallel Machine Learning

Church_GenomeAccess_2013_genome2013

New data from giab genomes promethion

Folker Meyer: Metagenomic Data Annotation

Biology-Derived Algorithms in Engineering Optimization

Implementation of DNA sequence alignment algorithms using Fpga ,ML,and CNN

CIBEC Presentation Fatma Sayed.pptx

Integration of single molecule, genome mapping data in a web-based genome bro...

ga-2.ppt

A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...

Recent software and services to support the SBML community

NAME's Structure of the Grammatic Genome

Second Order Heuristics in ACGP

Applying the Scientific Method to Simulation Experiments

Computational Approaches to Systems Biology

Cram 3.1 / Crumble

GIAB Sep2016 Lightning chen sun varmatch

Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...

Recently uploaded

Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...

Oleg Kshivets

Overall life span (LS) was 1671.7±1721.6 days and cumulative 5YS reached 62.4%, 10 years – 50.4%, 20 years – 44.6%. 94 LCP lived more than 5 years without cancer (LS=2958.6±1723.6 days), 22 – more than 10 years (LS=5571±1841.8 days). 67 LCP died because of LC (LS=471.9±344 days). AT significantly improved 5YS (68% vs. 53.7%) (P=0.028 by log-rank test). Cox modeling displayed that 5YS of LCP significantly depended on: N0-N12, T3-4, blood cell circuit, cell ratio factors (ratio between cancer cells-CC and blood cells subpopulations), LC cell dynamics, recalcification time, heparin tolerance, prothrombin index, protein, AT, procedure type (P=0.000-0.031). Neural networks, genetic algorithm selection and bootstrap simulation revealed relationships between 5YS and N0-12 (rank=1), thrombocytes/CC (rank=2), segmented neutrophils/CC (3), eosinophils/CC (4), erythrocytes/CC (5), healthy cells/CC (6), lymphocytes/CC (7), stick neutrophils/CC (8), leucocytes/CC (9), monocytes/CC (10). Correct prediction of 5YS was 100% by neural networks computing (error=0.000; area under ROC curve=1.0).

A Classical Text Review on Basavarajeeyam

Dr. Jyothirmai Paindla

share - Lions, tigers, AI and health misinformation, oh my!.pptx

Tina Purnat

• Pitfalls and pivots needed to use AI effectively in public health • Evidence-based strategies to address health misinformation effectively • Building trust with communities online and offline • Equipping health professionals to address questions, concerns and health misinformation • Assessing risk and mitigating harm from adverse health narratives in communities, health workforce and health system

Integrating Ayurveda into Parkinson’s Management: A Holistic Approach

Ayurveda ForAll

Role of Mukta Pishti in the Management of Hyperthyroidism

Dr. Jyothirmai Paindla

Cervical & Brachial Plexus By Dr. RIG.pptx

Dr. Rabia Inam Gandapore

Ozempic: Preoperative Management of Patients on GLP-1 Receptor Agonists

Saeid Safari

Non-respiratory Functions of the Lungs.pdf

MedicoseAcademics

These simplified slides by Dr. Sidra Arshad present an overview of the non-respiratory functions of the respiratory tract. Learning objectives: 1. Enlist the non-respiratory functions of the respiratory tract 2. Briefly explain how these functions are carried out 3. Discuss the significance of dead space 4. Differentiate between minute ventilation and alveolar ventilation 5. Describe the cough and sneeze reflexes Study Resources: 1. Chapter 39, Guyton and Hall Textbook of Medical Physiology, 14th edition 2. Chapter 34, Ganong’s Review of Medical Physiology, 26th edition 3. Chapter 17, Human Physiology by Lauralee Sherwood, 9th edition 4. Non-respiratory functions of the lungs https://academic.oup.com/bjaed/article/13/3/98/278874

Aortic Association CBL Pilot April 19 – 20 Bern

suvadeepdas911

Top 10 Best Ayurvedic Kidney Stone Syrups in India

Swastik Ayurveda

Adv. biopharm. APPLICATION OF PHARMACOKINETICS : TARGETED DRUG DELIVERY SYSTEMS

AkankshaAshtankar

Management of Traumatic Splenic injury.pptx

AkshaySarraf1

ABDOMINAL TRAUMA in pediatrics part one.

drhasanrajab

Abdominal trauma in pediatrics refers to injuries or damage to the abdominal organs in children. It can occur due to various causes such as falls, motor vehicle accidents, sports-related injuries, and physical abuse. Children are more vulnerable to abdominal trauma due to their unique anatomical and physiological characteristics. Signs and symptoms include abdominal pain, tenderness, distension, vomiting, and signs of shock. Diagnosis involves physical examination, imaging studies, and laboratory tests. Management depends on the severity and may involve conservative treatment or surgical intervention. Prevention is crucial in reducing the incidence of abdominal trauma in children.

Physiology of Special Chemical Sensation of Taste

MedicoseAcademics

Title: Sense of Taste Presenter: Dr. Faiza, Assistant Professor of Physiology Qualifications: MBBS (Best Graduate, AIMC Lahore) FCPS Physiology ICMT, CHPE, DHPE (STMU) MPH (GC University, Faisalabad) MBA (Virtual University of Pakistan) Learning Objectives: Describe the structure and function of taste buds. Describe the relationship between the taste threshold and taste index of common substances. Explain the chemical basis and signal transduction of taste perception for each type of primary taste sensation. Recognize different abnormalities of taste perception and their causes. Key Topics: Significance of Taste Sensation: Differentiation between pleasant and harmful food Influence on behavior Selection of food based on metabolic needs Receptors of Taste: Taste buds on the tongue Influence of sense of smell, texture of food, and pain stimulation (e.g., by pepper) Primary and Secondary Taste Sensations: Primary taste sensations: Sweet, Sour, Salty, Bitter, Umami Chemical basis and signal transduction mechanisms for each taste Taste Threshold and Index: Taste threshold values for Sweet (sucrose), Salty (NaCl), Sour (HCl), and Bitter (Quinine) Taste index relationship: Inversely proportional to taste threshold Taste Blindness: Inability to taste certain substances, particularly thiourea compounds Example: Phenylthiocarbamide Structure and Function of Taste Buds: Composition: Epithelial cells, Sustentacular/Supporting cells, Taste cells, Basal cells Features: Taste pores, Taste hairs/microvilli, and Taste nerve fibers Location of Taste Buds: Found in papillae of the tongue (Fungiform, Circumvallate, Foliate) Also present on the palate, tonsillar pillars, epiglottis, and proximal esophagus Mechanism of Taste Stimulation: Interaction of taste substances with receptors on microvilli Signal transduction pathways for Umami, Sweet, Bitter, Sour, and Salty tastes Taste Sensitivity and Adaptation: Decrease in sensitivity with age Rapid adaptation of taste sensation Role of Saliva in Taste: Dissolution of tastants to reach receptors Washing away the stimulus Taste Preferences and Aversions: Mechanisms behind taste preference and aversion Influence of receptors and neural pathways Impact of Sensory Nerve Damage: Degeneration of taste buds if the sensory nerve fiber is cut Abnormalities of Taste Detection: Conditions: Ageusia, Hypogeusia, Dysgeusia (parageusia) Causes: Nerve damage, neurological disorders, infections, poor oral hygiene, adverse drug effects, deficiencies, aging, tobacco use, altered neurotransmitter levels Neurotransmitters and Taste Threshold: Effects of serotonin (5-HT) and norepinephrine (NE) on taste sensitivity Supertasters: 25% of the population with heightened sensitivity to taste, especially bitterness Increased number of fungiform papillae

How STIs Influence the Development of Pelvic Inflammatory Disease.pptx

FFragrant

ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS

Dr. Vinay Pareek

Sex determination from mandible pelvis and skull

ShashankRoodkee

micro teaching on communication m.sc nursing.pdf

Anurag Sharma

263778731218 Abortion Clinic /Pills In Harare ,

sisternakatoto

263778731218 Abortion Clinic /Pills In Harare ,ABORTION WOMEN’S CLINIC +27730423979 IN women clinic we believe that every woman should be able to make choices in her pregnancy. Our job is to provide compassionate care, safety,affordable and confidential services. That’s why we have won the trust from all generations of women all over the world. we use non surgical method(Abortion pills) to terminate…Dr.LISA +27730423979women Clinic is committed to providing the highest quality of obstetrical and gynecological care to women of all ages. Our dedicated staff aim to treat each patient and her health concerns with compassion and respect.Our dedicated group ABORTION WOMEN’S CLINIC +27730423979 IN women clinic we believe that every woman should be able to make choices in her pregnancy. Our job is to provide compassionate care, safety,affordable and confidential services. That’s why we have won the trust from all generations of women all over the world. we use non surgical method(Abortion pills) to terminate…Dr.LISA +27730423979women Clinic is committed to providing the highest quality of obstetrical and gynecological care to women of all ages. Our dedicated staff aim to treat each patient and her health concerns with compassion and respect.Our dedicated group of receptionists, nurses, and physicians have worked together as a teamof receptionists, nurses, and physicians have worked together as a team wwww.lisywomensclinic.co.za/

Physiology of Chemical Sensation of smell.pdf

MedicoseAcademics

Title: Sense of Smell Presenter: Dr. Faiza, Assistant Professor of Physiology Qualifications: MBBS (Best Graduate, AIMC Lahore) FCPS Physiology ICMT, CHPE, DHPE (STMU) MPH (GC University, Faisalabad) MBA (Virtual University of Pakistan) Learning Objectives: Describe the primary categories of smells and the concept of odor blindness. Explain the structure and location of the olfactory membrane and mucosa, including the types and roles of cells involved in olfaction. Describe the pathway and mechanisms of olfactory signal transmission from the olfactory receptors to the brain. Illustrate the biochemical cascade triggered by odorant binding to olfactory receptors, including the role of G-proteins and second messengers in generating an action potential. Identify different types of olfactory disorders such as anosmia, hyposmia, hyperosmia, and dysosmia, including their potential causes. Key Topics: Olfactory Genes: 3% of the human genome accounts for olfactory genes. 400 genes for odorant receptors. Olfactory Membrane: Located in the superior part of the nasal cavity. Medially: Folds downward along the superior septum. Laterally: Folds over the superior turbinate and upper surface of the middle turbinate. Total surface area: 5-10 square centimeters. Olfactory Mucosa: Olfactory Cells: Bipolar nerve cells derived from the CNS (100 million), with 4-25 olfactory cilia per cell. Sustentacular Cells: Produce mucus and maintain ionic and molecular environment. Basal Cells: Replace worn-out olfactory cells with an average lifespan of 1-2 months. Bowman’s Gland: Secretes mucus. Stimulation of Olfactory Cells: Odorant dissolves in mucus and attaches to receptors on olfactory cilia. Involves a cascade effect through G-proteins and second messengers, leading to depolarization and action potential generation in the olfactory nerve. Quality of a Good Odorant: Small (3-20 Carbon atoms), volatile, water-soluble, and lipid-soluble. Facilitated by odorant-binding proteins in mucus. Membrane Potential and Action Potential: Resting membrane potential: -55mV. Action potential frequency in the olfactory nerve increases with odorant strength. Adaptation Towards the Sense of Smell: Rapid adaptation within the first second, with further slow adaptation. Psychological adaptation greater than receptor adaptation, involving feedback inhibition from the central nervous system. Primary Sensations of Smell: Camphoraceous, Musky, Floral, Pepperminty, Ethereal, Pungent, Putrid. Odor Detection Threshold: Examples: Hydrogen sulfide (0.0005 ppm), Methyl-mercaptan (0.002 ppm). Some toxic substances are odorless at lethal concentrations. Characteristics of Smell: Odor blindness for single substances due to lack of appropriate receptor protein. Behavioral and emotional influences of smell. Transmission of Olfactory Signals: From olfactory cells to glomeruli in the olfactory bulb, involving lateral inhibition. Primitive, less old, and new olfactory systems with different path

Recently uploaded (20)

Local Advanced Lung Cancer: Artificial Intelligence, Synergetics, Complex Sys...

A Classical Text Review on Basavarajeeyam

share - Lions, tigers, AI and health misinformation, oh my!.pptx

Integrating Ayurveda into Parkinson’s Management: A Holistic Approach

Role of Mukta Pishti in the Management of Hyperthyroidism

Cervical & Brachial Plexus By Dr. RIG.pptx

Ozempic: Preoperative Management of Patients on GLP-1 Receptor Agonists

Non-respiratory Functions of the Lungs.pdf

Aortic Association CBL Pilot April 19 – 20 Bern

Top 10 Best Ayurvedic Kidney Stone Syrups in India

Adv. biopharm. APPLICATION OF PHARMACOKINETICS : TARGETED DRUG DELIVERY SYSTEMS

Management of Traumatic Splenic injury.pptx

ABDOMINAL TRAUMA in pediatrics part one.

Physiology of Special Chemical Sensation of Taste

How STIs Influence the Development of Pelvic Inflammatory Disease.pptx

ARTHROLOGY PPT NCISM SYLLABUS AYURVEDA STUDENTS

Sex determination from mandible pelvis and skull

micro teaching on communication m.sc nursing.pdf

263778731218 Abortion Clinic /Pills In Harare ,

Physiology of Chemical Sensation of smell.pdf

Agbt2015 workshop schneider

1. Advancing the Human Reference Assembly Valerie Schneider NCBI 25 February 2015 The Human Reference Genome: Today, Tomorrow and Next ? http://genomereference.org

2. Dilthey et al.Paten et al. Scientific Models

3. Outline • The assembly model • Basics • Value added • Challenges • Future relevance of the reference • Multiple genomes • Haploid genomes • Assembly updates • Mechanisms • Requirements/Challenges

4. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus Current Assembly model: represent both haplotypes GRC Assembly Model many

5. Assembly (e.g. GRCh38) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Church et al., PLoS Biol. 2011 Jul;9(7):e1001091 GRC Assembly Model ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 ALT 1

6. GRC Assembly Model Alt loci alignments are an integral part of the assembly model alignment to chr + scaffold sequence = Alt

7. GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes • Average alt length = 400 kb, max = ~5 Mb GRCh38

8. GRC Assembly Model The human reference assembly represents population genomic diversity in the context of linear sequences

9. GRCh38: Alt Loci Alignment Legend no alignmentmismatchdeletion

10. GRCh38: Alt Loci GRCh38 alt loci alignment GRCh37 chr. 7

11. chromosome alt/patch reads On-target alignment Off-target alignments (n=122,922) GRCh38: Alt Loci

12. GRC Assembly Model http://notvine.co/ http://designtaxi.com/

13. Challenges: Allelic Duplication

14. Challenges: Reporting Multiple Locations

15. SRPRISM Challenges: Solutions https://github.com/samtools/hts-specs/issues/51

16.

17. Multiple Genome Era ADMIXTURE? http://medcitynews.com/

18. Multiple Genome Era

19. Assembly (e.g. GRCh38.p1) Primary Assembly Unit Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 6 ALT 7 PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Patches Genomic Region (ABO) Genomic Region (FOXO6) Genomic Region (FCGBP) Assembly Updates Patches FIX NOVEL SCAFFOLD STATUS AT NEXT MAJOR ASSEMBLY RELEASE ALT LOCI -- (integrated) Treat as: Allelic Treat as: Preferred

20. Assembly Updates

21. Assembly Updates GRC • Finished Quality • INSDC Accessioned • Representative of an actual DNA molecule Criteria for Reference Assembly Component Sequences

22. Summary • Reference Assembly: Today • Multi-allelic • Need compatible toolsuites • Reference Assembly: Tomorrow • Defining sequence context • Providing coordinates • Reference Assembly: Next ? • Patches • Challenges

23. GRCh38 Collaborators • NCBI RefSeq and gpipe annotation team • Havana annotators • Karen Miga • David Schwartz • Steve Goldstein • Mario Caceres • Giulio Genovese • Jeff Kidd • Peter Lansdorp • Mark Hills • David Page • Jim Knight • Stephan Schuster • 1000 Genomes GRC SAB • Rick Myers • Granger Sutton • Evan Eichler • Jim Kent • Roderic Guigo • Carol Bult • Derek Stemple • Jan Korbel • Liz Worthey • Matthew Hurles • Richard Gibbs GRC Credits Workshop sponsor: http://genomereference.org

Editor's Notes

I’d like open this workshop by reminding everyone of the difference between a genome and an assembly. A human genome is a physical object. An assembly is our representation of that object. It is a model. And as shown here, genome models can take many forms. And as these atomic models illustrate, scientific models evolve over time to reflect our growing knowledge base. And so it is with the human assembly model, the reference genome.
Today’s workshop addresses the advancement of the human reference genome assembly in the context of new data and technologies. In my talk, I’ll discuss the current reference assembly, highlighting the following topics (read outline). I’ll be followed by Karyn, Tina and Deanna, who will each be talking in more detail about some of these items that I introduce.
When assembling the genome of single diploid individual, there may be divergent haplotypes that confound genome assembly. In the original reference assembly model, which was essentially a stick model of linear chromosomes, there really wasn’t a good way to represent highly variant or complex genomic regions. Different haplotypes were simply compressed into a consensus. The insertion of different haplotypes, however, often led to non-existent allele combinations and artificial gaps, as illustrated here. This issue led the GRC to develop a new assembly model several years ago that has a mechanism to cleanly represent multiple haplotypes: alternate loci. They allow the reference assembly to contain alternate representations for regions where haplotype compression isn’t appropriate or a single sequence path is considered insufficient. At the same time, the model retains the linear chromosomes with which most users are comfortable. As a result of the adoption of this model, it’s important to understand that the reference assembly isn’t a haploid or even a diploid genome representation. For any locus, it can represent many haplotypes.
This slide explains how the assembly model accomplishes this. The first thing to know is that the “assembly” is comprised of multiple assembly units. The primary assembly unit is the collection of chromosomes and unlocalized and unplaced scaffolds. This is essentially the original haploid assembly model. Non-nuclear genomes are assigned to their own assembly unit. Regions are defined for those areas of the genome for which alternate sequence representation is desired. Alternate sequence representations for those regions go into alternate loci assembly units. The first alternate sequence representations for each region goes into into one assembly unit. Each additional sequence representation for a region goes into its own assembly unit. We also define the PAR regions, to account for sequence shared by the sex chromosomes.
The alternate loci are stand-alone accessioned scaffold sequences that are given chromosome context via their alignment to the primary assembly unit. This image shows a portion of GRCh38 chr. 17, with its regions and alt loci alignments. As you can see, the relationships of the alts to the primary assembly can be complex, with indels and inversions. For this reason, the GRC curates these alignments. One point I want to make is that the alignments of the alt loci to the chromosomes are an integral part of the assembly model. The alignment, in conjunction with the sequence, is what defines the alt. The alignments are available for download with the assembly from GenBank.
The ideogram image in this slide shows the genome-wide locations of alternate loci in GRCh38, along with some basic alt loci stats.
What all this means is that you don’t have to wait for the development of a graph-based genome representation and corresponding tool suites to do genomic analyses that benefit from variant sequence representations. The current assembly model allows the reference to represent population genomic diversity in the context of linear sequences, which are the currency for most existing analysis pipelines. The next couple of slides show you some of the value added to analyses by use of the full assembly model.
Gene content is one way in which alt loci add value to the assembly. In this slide, you can see several genes annotated in the regions of this alternate representation of the chr. 19 KIR region that have no alignment to the chromosome. Deanna will tell you more about genes unique to alt loci in her talk. [Thus, if you’re not using the entire assembly in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.]
Alternate loci also have implications for read mapping and data interpretation. This image from the NCBI 1000G browser shows a region of GRCh37 chr.7 encompassing UPK3B, a gene expressed in primary mesothelial cells. The chromosomal representation of UPK3B has not changed in GRCh38, and is believed to represent a relatively rare insertion allele. An alternate loci for this region is included in GRCh38, and represents the deletion allele, as illustrated by its alignment to GRCh37. As illustrated by the 3 samples shown here, alignment profiles in this region vary depending on the alignment method used, in this case bwa or mosaik. As a result, it’s difficult to ascertain the genotypes of these samples or the distribution of these alleles in the human population. However, with the inclusion of the alternate scaffold, we can better interpret the data. This slide shows the alignment of previously unmapped reads from one of these 1000G samples to the GRCh38 alt across the indel boundary, indicating that the sample contains the deletion allele. From analyses such as these, we can see how the inclusion of alternate loci in alignment target sets may improve alignments and data interpretation.
Alternate loci also have a broader impact on read alignments. Since we first developed this model, we’ve been interested in the effect of alt loci on read mapping. This slide describes a study we did a few years ago with the GRCh37.p9 assembly. We looked at the alignment behavior of simulated reads sourced from sequence unique to alt loci. We asked what happened to them when aligned to the primary assembly unit without the alt loci, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism). As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters had an off-target alignment on the primary assembly unit (in blue). These off-target alignments are likely to result in errors in variation analyses. This analysis demonstrates the broader value of including alternate loci in alignment target sets.
While it’s clear that alternate loci add value to the reference assembly, you need the right set of tools to take advantage of them. Unfortunately, using many common analyses suites and file formats with the current assembly model is kind of like eating yogurt with chopsticks. They give you a taste of the richness of the data, but leave a lot behind. This is a point that Deanna will address in greater detail later today, so I’ll only outline the challenges researchers face in using the full assembly. But because the assembly model is still based on linear sequences, it should be possible to modify our current tools and file formats to take full advantage of the reference, rather than starting from scratch.
The first issue is allelic duplication. Most current aligners cannot distinguish the allelic duplication introduced by alternate loci from segmental duplication. As a result, reads aligning to sequence common to the chromosomes and alternate loci tend to be down-weighted and excluded from further analysis. This slide shows a graphical view of an alt locus scaffold, with the alignments of the chromosome and reads from a 1000G sample. The top set of alignments represent reads that aligned to both the alt and the chr. The bottom set are the reads that aligned only to the alt. Zooming in, we see these are reads aligning to an insertion in the alt sequence. Unless the aligner can distinguish chromosomal regions associated with alt loci and not down-weight alignments in those regions, the gains in picking up new read alignments are likely to be offset by the discarding of other alignments.
Another challenge to using alternate loci comes in reporting features associated with >1 location. As shown in this image that illustrates the TNXB locus on the reference chromosome and 3 alts, genes may have different structures in different locations. Modifications to file formats such as GFF will make it easier to recognize sequence relationships across the assembly when reporting gene and exon locations. Variant analysis and reporting is another area where changes are needed. As illustrated here, GRCh38 includes representations for the two major haplotypes at the MAPT locus. Depending on sample genotype, it may be desirable to report on more than one representation. However, the VCF format requires modification to support this.
A GRC workshop held last fall led to a publication that helped raise awareness of these issues, and some proposals, such as this one by Aaron Quinlan to make VCF alt-friendly, were discussed. There’s a git issue available for those who are interested. Additionally, bwa-mem recently became alt aware, joining SRPRISM as an alternate aware short read aligner. These changes show that use of the full assembly model is possible and the necessary tools are starting to become available.
I now want to shift gears and discuss the place of the reference as we enter a new era in which this assembly may no longer stand apart in terms of its quality or completeness. It’s important to remember that the human reference assembly is a special kind of genome model. In today’s era of personal genome sequencing, most assemblies only model a haploid or diploid genome. But the reference assembly is a model of many diploid genomes, meant to represent the “human” genome. This slide shows the assembly composition of the GRCh38 primary assembly. While 70% of the genome comes from one donor, sequence from >70 individuals is represented.
Because the reference is derived from many individuals and includes alternate sequence representations, it is likely to remain our best resource for putting sequences identified in any individual into a genomic context. Likewise, b/c a common coordinate system remains critical for communication and reporting purposes, we’re likely to see the reference retain this role as well. The table shows the latest versions of human genome assemblies in GenBank. Those in red are population-specific, and more population-specific genomes are under construction today. When analyzing samples from known populations, population-specific references or collections of population-specific genomes may be particularly valuable for variant or haplotype analysis. Even with a reference that is a graph of population variation, certain analyses may benefit from using only sub-paths in the graph. However, it’s important to realize that the utility of population-specific references may be limited for admixed samples. Given that much of the US population is admixed, this is an important consideration for resource development. Today you’ll hear from Tina about gold genomes, a set of genomes from diverse populations that are being sequenced to provide new representations for some of the genome’s most variable regions. These data will be incorporated into the reference.
Karyn and Tina will also be talking today about platinum assemblies that are derived from hydatidiform moles, which have haploid genomes. Without allelic duplication complicating their assembly, these resources facilitate the resolution of some of the most complex segmentally duplicated genome regions. These platinum genomes will be assembled to reference quality. However, it’s important to realize that there are no plans to replace the reference with either of these platinum mole assemblies. Like other individual genomes, they are limited in their representation of diversity. As you’ll hear, the GRC does intend to use these genomes to improve or augment the reference. As we enter a new era of multiple high quality genomes, we still envision the reference playing important roles.
In the last few minutes of this talk, I’ll discuss ongoing efforts to improve the reference. The “patches” feature of the model allows the GRC to make assembly updates available in a timely fashion without disrupting the chromosome coordinates upon which other users rely. Regions are defined for the genomic locations to be updated, and the sequences representing those updates are put into the “Patches” assembly unit. Like the alt loci, the patches are stand-alone scaffold sequences with alignments. It’s important to distinguish the two types of patches and the ways in which they should be used for analysis: (1) FIX patches correct problems in the assembly: deprecated in next assembly release. (2) NOVEL patches add new alternate sequence representations to the assembly: become alternate loci in the next assembly release.
An example of a GRCh38 fix patch is shown on top in this issue summary from the GRC website, where sequence from a fosmid was used to patch a deleted BAC disrupting representation of the FOXO6 gene. An example of a GRCh38 novel patch is shown on the bottom, where the GRC (in collaboration with the Pharmacogenomics Research Network) added representation for another structural variant of the CYP2D6 locus. The GRC releases patches on a quarterly cycle with the next release planned for the end of March.
With all of the NGS and new genome data, you might think that the GRC is awash in sequences with which to update the assembly. But the reality sometimes feels more like this. Although there is a lot of sequence data available, sequence meeting all 3 of these reference criteria is still limited. Quality is less of an issue today than a couple years ago, but more groups doing sequencing and assembly are putting their data on “public” FTP sites, but not submitting it to an INSDC database. We encourage groups to submit their data so that it can contribute to this valuable public resource. Lastly, the reference assembly is clone based, and all component sequences are representative of a DNA molecule found in an actual individual. As long as the community feels it is important for the reference to represent actual sequences, the ability to phase or resolve haplotypes in newly sequence genomes or generate finished quality sequence from single molecules will be critical to incorporating new sequence into the assembly.

Agbt2015 workshop schneider

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Agbt2015 workshop schneider

Similar to Agbt2015 workshop schneider (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (18)

Recently uploaded

Recently uploaded (20)

Agbt2015 workshop schneider

Editor's Notes