GRC Workshop at Churchill College on Sep 21, 2014. This is Aaron Quinlan's talk on issues with representing variants in the full assembly, with suggestions for VCF modifications for handling variant calls on the alts.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
Presentation by Valerie Schneider at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...Eli Kaminuma
IIBMP2020 Poster #77 Generating annotation texts of HLA sequences with antigen classes by a T5 (Text-to-Text Transfer Transformer) model using International Nucleotide Sequence Database
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
Presentation by Valerie Schneider at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
[2020-09-01] IIBMP2020 Generating annotation texts of HLA sequences with anti...Eli Kaminuma
IIBMP2020 Poster #77 Generating annotation texts of HLA sequences with antigen classes by a T5 (Text-to-Text Transfer Transformer) model using International Nucleotide Sequence Database
Using the GRCh38 reference assembly for clinical interpretation in VSClinicalGolden Helix
Although the latest reference genome (build 38) was released in 2009, it has taken quite a while to come into its own as a baseline for the clinical interpretation of variants in human disease. A lot of this was momentum, while some of it was concerns about compatibility with other labs and published literature. Yet the largest hindrance was the lack of support of the bioinformatic tools and requisite databases required to analyze variants. When we released VSClinical, we wanted those concerns to be removed from the choice of what reference genome a lab may choose to use.
In this webcast, we will:
Review the cost and benefits of using the new GRCh38 reference genome in the context of clinical genetics.
Look at the public data resources that have native support, and Golden Helix’s effort to lift over the ones that do not.
Provide examples of situations where changing reference genomes introduces or negates artifacts caused by errors in the reference sequence.
Demonstrate VarSeq’s new ability to lift over to the new reference genome while importing your VCFs into a project.
Go through VSClinical using the new human reference genome, with full annotation support and downstream use of assessment catalogs and writing of reports.
Whether you have already made the switch or considering the possibility, this webcast will provide you with what you need to know about using the new reference genome for clinical genetic testing as well as human disease research using VarSeq. We hope to see you there!
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
Deep genome sequencing has revolutionized the fields of biology and medicine. Since January 2008, the capacity to generate sequence data has increased exponentially, far outpacing Moore's Law. The emergence of scalable NoSQL database technologies has made the analysis of this vast amount of sequence data not only feasible, but cost effective.
The University of California at San Francisco UCSF-Abbott Viral Detection and Discovery Center, led by director Charles Chiu, MD, PhD, Taylor Sittler, MD and the Hypertable development team have embarked upon a project to build a scalable software platform to facilitate deep sequencing analysis in diagnostic microbiology, transcriptomic analysis, and clinical / environmental metagenomics, areas for which existing commercial and academic solutions are sorely lacking. Doug Judd, the original creator of Hypertable, will present an overview of this genome sequencing analysis system. The presentation will cover the following topics:
Rationale for choosing NoSQL
Schema design
Sources and description of input data
Algorithms for generating and querying lookup tables
Table sizes and compression ratios
Lessons learned during system deployment
The Clinical Significance of Transcript Alignment DiscrepanciesReece Hart
Gene transcripts are the lens through which we understand variants that are identified by genome sequencing, reported in scientific literature, and communicated on clinical reports. An accurate, shared representation of transcripts is essential to communicating variants reliably. This talk presents observations of significant discrepancies between sources of transcripts that will lead to discrepancies in the clinical interpretation of variants, and tools that we have released to contend with these complexities.
Presentation carried out by Sergi Beltran Agulló, from the CNAG, at the course: Identification and analysis of sequence variants in sequencing projects: fundamentals and tools .
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platformThermo Fisher Scientific
Dendritic cell (DC) lineages coordinate immune system activity
through functional specialization.
• Irf4, a transcription factor(TF), is required for CD11b+ DC
lineage development from bone marrow stem cells and has
been implicated in multiple inflammatory diseases, eg. asthma.
• The epigenetic consequences of immune specialization in
CD11b+ DCs and relation to inflammatory diseases remain
largely unexplored partly due to the difficulty of using highly
purified, and typically, limited populations of cells in ChIP-seq
(chromatin immunoprecipitation then sequencing) assays.
• A robust, multiplexed ChIP-seq protocol – using an input
control, TF (CTCF) and histone modification marks (H3K9me3-
methylation, H3K27ac-acetylation) - was developed using
limited amounts of K562 cells, for the Ion ProtonTM system.
• Peak-calling analysis was performed using using MACS2.
• Significant data correlations were observed with ENCODE.
• The Ion ProtonTM results are based on chromatin derived from
1 million(M) cells, making it viable for generating data from a
limited number of primary cells. This is in contrast to the 10M
cells recommended by ENCODE.
• The developed methodology was used to compare Irf4 genomic
binding sites generated from flow-sorted populations of 1, 3, 5,
and 20M CD11b+ lineage murine DCs.
• Comparable Irf4 ChIP-seq results were obtained from 5M
versus 20M cells, indicating that as low as 5M flow-sorted cells
can be used to acquire high quality(FDR: 10-19) data.
• We identified genomic Irf4 binding sites proximal to genes,
whose activity is consistent with CD11b+ DC lineage activity
and/or known to contribute to inflammatory disease.
• We examined Irf4 functional regulation of the identified gene
targets via RNA-seq analysis with CD11b+ DCs and a related
lineage, CD103+ DCs. Integrating expression analysis with
ChIP-seq indicates a unique CD11b+ DC gene expression
program concordant with Irf4 loci association in comparison to
CD103+ DC (data not shown).
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
Epinomics is advancing epigenetic research to drive personalized medicine, using epigenomic data analysis. Their goal is to provide an analysis resource to the community that will promote high-quality data and replicable and interpretable results. They work with academic and commercial users to ingest and analyze their genomic sequencing data and metadata. They extract epigenetic features from the sequenced genome, called “chromatin accessibility”, which are indicative of instrumental epigenetic changes responsible for differential gene expression and disease development.
Epinomics has built an Apache Spark-based pipeline that retrieves chromatin accessibility data from the epigenome, uses GraphX to find overlapping accessibility atlas and then clusters the data and runs machine learning algorithms. This session will provide a primer on epigenomics, details about Epinomics’ Spark-based data pipeline focusing on parallel bioinformatic analysis, and how they use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy. use GraphX to find overlapping accessibility atlas and then cluster the data and run machine learning algorithms.
In this talk we will provide a primer on epigenomics, details about our Spark based data pipeline focusing on parallel bioinformatic analysis and how we use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy.
Presentation at 2019 ASHG GRC/GIAB workshop describing history of the human reference genome, current curation efforts and future plans, and the relationship of all 3 to efforts to produce a human pan-genome.
Platform presentation at ASHG 2019 describing recent updates to the human reference genome assembly (GRCh38) and future plans with relevance to pan-genomic representations.
Presentation at 2019 ASHG GRC/GIAB workshop describing goals and progress of the telomere-to-telomere consortium to generate a genome assembly that provides representation of all sequences, including repetitive regions.
Presentation at 2019 ASHG GRC/GIAB workshop describing features and recent updates to the vg toolkit, including examples of comparisons to other methods used for alignment and variant detection.
Presentation at 2019 ASHG GRC/GIAB workshop describing recent updates to the MANE project, which aims to provide matched annotation from RefSeq and GENCODE.
Presentation at PanGenomics in the Cloud Hackathon, run by NCBI at UCSC (https://ncbiinsights.ncbi.nlm.nih.gov/2019/02/06/pangenomics-cloud-hackathon-march-2019/). Presents points to consider about the adoption of a pangenome reference, emphasizing aspects for long-term data management and wide-spread adoption.
Presentation by Fritz Sedlazeck at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on characterizing human structural variation.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Presentation by Karen Miga at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on centromere assemblies.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Variant Calling II
1. Variant calling
while accounting for
alternate haplotypes
Aaron Quinlan
University of Virginia
!
quinlanlab.org
Genome Reference Consortium Workshop
Sept 21, 2014
2. Motivation: HG has long had alt loci, but we don’t handle them properly
Deanna Church and Brad Holmes
Regions w/ alternate loci
(261 in total)
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
3. A case study: The MAPT locus (17q21.31)
doi:10.1038/nrneurol.2012.169
Disease phenotypes associated with each haplotype
H2 positively selected in Europeans, carriers predisposed to 17q21.31
microdeletion syndrome via NAHR
H1 associated with Alzheimer and Parkinson diseases
9. How do we support
variant detection with
alternate alleles using the
VCF format?
10. A G A C A C T A G T C T
H1 (chr17) H2 (GL000258v2_alt)
NA10847 is heterozygous H1/H2
11. A G A C A C T A G T C T
H1 (chr17) H2 (GL000258v2_alt)
A G A C A C
T A G T C T
NA10847 (het. H1/H2)
NA10847 is heterozygous H1/H2
12. A G A C A C T A G T C T
H1 (chr17) H2 (GL000258v2_alt)
A G A C A C
T A G T C T
NA10847 (het. H1/H2)
NA10847 is heterozygous H1/H2
A G A C A C
T A G T C T
A G A C A C
T A G T C T
Requires that we align
reads to both loci*
* We need to be able to distinguish multiple alignments arising from ALT loci versus multiple
mappings arising from segdups, repetitive elements. Else, MAPQs penalized and/or alignments
not reported, depending on the behavior of the aligner.
!
Heng Li has started a discussion about how best to make BWA alt-aware
Colin Hercus has started a discussion about how best to make Novoalign alt-aware
13. chr17 A T 0/1
chr17 G A 0/1
chr17 A G 0/1
chr17 C T 0/1
chr17 A C 0/1
chr17 C T 0/1
GL000258v2_alt T A 0/1
GL000258v2_alt A G 0/1
GL000258v2_alt G A 0/1
GL000258v2_alt T C 0/1
GL000258v2_alt C A 0/1
GL000258v2_alt T C 0/1
VCF entries for chr17 (H1) VCF entries for alt locus (H2)
Variant calling
NA10847 is heterozygous H1/H2
14. chr17 A T 0/1
chr17 G A 0/1
chr17 A G 0/1
chr17 C T 0/1
chr17 A C 0/1
chr17 C T 0/1
GL000258v2_alt T A 0/1
GL000258v2_alt A G 0/1
GL000258v2_alt G A 0/1
GL000258v2_alt T C 0/1
GL000258v2_alt C A 0/1
GL000258v2_alt T C 0/1
VCF entries for chr17 (H1) VCF entries for alt locus (H2)
Variant calling
NA10847 is heterozygous H1/H2
Note: Allelic
relationship is
not reflected in
“raw” VCF
15. Proposal: Develop a downstream
tool that leverages informative SNPs
to distinguish and assign haplotypes
at alt loci via a standard VCF file.
Intermediate solution until variant callers handle this
complexity natively
17. Overview
“Raw”
V/BCF file
“database” of
alt loci &
inform.
SNPs
alt_locus main_locus
GL000258.2 (H2) chr17:45309498-46836265 (H1)
infor. markers GRCh38 position H1 H2
rs241039 45637307 A T
rs2049515 45684490 C T
. . .
18. Overview
“Raw”
V/BCF file
“database” of
alt loci &
inform.
SNPs
New tool(s)
alt_locus main_locus
GL000258.2 (H2) chr17:45309498-46836265 (H1)
infor. markers GRCh38 position H1 H2
rs241039 45637307 A T
rs2049515 45684490 C T
. . .
19. Overview
“Raw”
V/BCF file
“database” of
alt loci &
inform.
SNPs
New tool(s)
Updated
V/BCF file
Haplotype
predictions
per indiv.
alt_locus main_locus
GL000258.2 (H2) chr17:45309498-46836265 (H1)
infor. markers GRCh38 position H1 H2
rs241039 45637307 A T
rs2049515 45684490 C T
. . .
20. Augment VCF with assembly information
##seq-info=<name=chr17, id=CM000679.2>
!
##region-info=<name=MAPT, id=GL000258.2,
assoc_id=CM000679.2, reg=45309498-46836265>
Based on ideas from Deanna Church
21. Introduce new (reserved) VCF INFO tags
##INFO=<ID=ALTLOCS,
Number=.,
Type=String,
Description=“A list of the alternate
loci in the reference
genome that are
associated with this
locus”>
##INFO=<ID=ALTHAPS,
Number=.,
Type=String,
Description=“A list of the known
haplotypes that are
associated with this
locus”>
##FORMAT=<ID=HT,Number=1,Type=String,Description=“Haplotype
combination based on ALTHAPS">
22. (Draft) example VCF output, post “correction”
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT,id=GL000258.2,assoc_id=CM000679.2,reg=45309498-46836265>
##INFO=<ID=ALTLOC,Number=.,Type=String,Description=“A list of the alternate loci is
the reference genome that are associated with this locus”>
##INFO=<ID=ALTHAP,Number=.,Type=String,Description=“A list of the known haplotypes
that are associated with this locus”>
!
#CHROM POS REF ALT INFO FORMAT NA10847
chr17 111 A T ALTLOCS=GL000258.2;ALTHAPs=H1,H2 GT:HT 0/1:0/1
chr17 222 G A ALTLOCS=GL000258.2;ALTHAPs=H1,H2 GT:HT 0/1:0/1
chr17 333 A G ALTLOCS=GL000258.2;ALTHAPS=H1,H2 GT:HT 0/1:0/1
. . .
GL000258.2 111 A T ALTLOCS=chr17;ALTHAPs=H1,H2 GT:HT 0/1:0/1
GL000258.2 222 G A ALTLOCS=chr17;ALTHAPs=H1,H2 GT:HT 0/1:0/1
GL000258.2 333 A G ALTLOCS=chr17;ALTHAPS=H1,H2 GT:HT 0/1:0/1
. . .
23. (Draft) example VCF output, post “correction”
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT,id=GL000258.2,assoc_id=CM000679.2,reg=45309498-46836265>
##INFO=<ID=ALTLOC,Number=.,Type=String,Description=“A list of the alternate loci is
the reference genome that are associated with this locus”>
##INFO=<ID=ALTHAP,Number=.,Type=String,Description=“A list of the known haplotypes
that are associated with this locus”>
!
#CHROM POS REF ALT INFO FORMAT NA10847
chr17 111 A T ALTLOCS=GL000258.2;ALTHAPs=H1,H2 GT:HT 0/1:0/1
chr17 222 G A ALTLOCS=GL000258.2;ALTHAPs=H1,H2 GT:HT 0/1:0/1
chr17 333 A G ALTLOCS=GL000258.2;ALTHAPS=H1,H2 GT:HT 0/1:0/1
. . .
GL000258.2 111 A T ALTLOCS=chr17;ALTHAPs=H1,H2 GT:HT 0/1:0/1
GL000258.2 222 G A ALTLOCS=chr17;ALTHAPs=H1,H2 GT:HT 0/1:0/1
GL000258.2 333 A G ALTLOCS=chr17;ALTHAPS=H1,H2 GT:HT 0/1:0/1
. . .
Solely two different haplotypes is the base case.
24. For example NA10847 is actually H1/H2D
H2D derived from
ancestral H2.
Markers that distinguish
H2D from H2
25. chr17 A T 0/1
chr17 G A 0/1
chr17 A G 0/1
chr17 C T 0/1
chr17 A C 0/1
chr17 C T 0/1
chr17 C T 0/1
chr17 C T 0/1
chr17 G A 0/1
chr17 A G 0/1
chr17 C T 0/1
GL000258v2_alt T A 0/1
GL000258v2_alt A G 0/1
GL000258v2_alt G A 0/1
GL000258v2_alt T C 0/1
GL000258v2_alt C A 0/1
GL000258v2_alt T C 0/1
GL000258v2_alt C T 0/1
GL000258v2_alt C T 0/1
GL000258v2_alt G A 0/1
GL000258v2_alt A G 0/1
GL000258v2_alt C T 0/1
VCF entries for chr17 VCF entries for alt locus
Interpretation is much harder w/ many haplotypes
H1 H2 H2D
26. (Draft) example VCF output, post “correction”
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT,id=GL000258.2,assoc_id=CM000679.2,reg=45309498-46836265>
##INFO=<ID=ALTLOC,Number=.,Type=String,Description=“A list of the alternate loci is the reference genome that
are associated with this locus”>
##INFO=<ID=ALTHAP,Number=.,Type=String,Description=“A list of the known haplotypes that are associated with
this locus”>
!
#CHROM POS REF ALT INFO FORMAT NA10847 NA12878 NA21599
chr17 111 A T ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 222 G A ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 333 A G ALTLOCS=GL000258.2;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
GL000258.2 111 A T ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 222 G A ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 333 A G ALTLOCS=chr17;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
27. (Draft) example VCF output, post “correction”
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT,id=GL000258.2,assoc_id=CM000679.2,reg=45309498-46836265>
##INFO=<ID=ALTLOC,Number=.,Type=String,Description=“A list of the alternate loci is the reference genome that
are associated with this locus”>
##INFO=<ID=ALTHAP,Number=.,Type=String,Description=“A list of the known haplotypes that are associated with
this locus”>
!
#CHROM POS REF ALT INFO FORMAT NA10847 NA12878 NA21599
chr17 111 A T ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 222 G A ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 333 A G ALTLOCS=GL000258.2;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
GL000258.2 111 A T ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 222 G A ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 333 A G ALTLOCS=chr17;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
NA10847 chr17 45309498 46836265 H1 H2D rs241039,rs434428,…
NA12878 chr17 45309498 46836265 H1 H1 rs241039,rs434428,…
NA21599 chr17 45309498 46836265 H2D H2D rs241039,rs434428,…
sample chrom start end hap1 hap2 markers
28. (Draft) example VCF output, post “correction”
##seq-info=<name=chr17, id=CM000679.2>
##region-info=<name=MAPT,id=GL000258.2,assoc_id=CM000679.2,reg=45309498-46836265>
##INFO=<ID=ALTLOC,Number=.,Type=String,Description=“A list of the alternate loci is the reference genome that
are associated with this locus”>
##INFO=<ID=ALTHAP,Number=.,Type=String,Description=“A list of the known haplotypes that are associated with
this locus”>
!
#CHROM POS REF ALT INFO FORMAT NA10847 NA12878 NA21599
chr17 111 A T ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 222 G A ALTLOCS=GL000258.2;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
chr17 333 A G ALTLOCS=GL000258.2;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
GL000258.2 111 A T ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 222 G A ALTLOCS=chr17;ALTHAPs=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
GL000258.2 333 A G ALTLOCS=chr17;ALTHAPS=H1,H2,H2D GT:HT 0/1:0/2 0/0:0/0 1/1:2/2
. . .
NA10847 chr17 45309498 46836265 H1 H2D rs241039,rs434428,…
NA12878 chr17 45309498 46836265 H1 H1 rs241039,rs434428,…
NA21599 chr17 45309498 46836265 H2D H2D rs241039,rs434428,…
sample chrom start end hap1 hap2 markers
29. The good and the bad
Good
• No burden on existing
variant callers to adapt to
calling w.r.t. alt loci
• Tool for updating VCF can
be updated and improved in
parallel with variant callers.
Bad
• One more step / file in the
variant interpretation pipeline
• This strategy is only
applicable to cases where
informative markers exist.
Use CNVs in WGS: e.g.,
KANSL1 partial duplications
to distinguish MAPT alt loci
30. The good and the bad
Good
• No burden on existing
variant callers to adapt to
calling w.r.t. alt loci
• Tool for updating VCF can
be updated and improved in
parallel with variant callers.
Bad
• One more step / file in the
variant interpretation pipeline
• This strategy is only
applicable to cases where
informative markers exist.
Use CNVs in WGS: e.g.,
KANSL1 partial duplications
to distinguish MAPT alt loci
To Do / Discuss
• How best to improve alignment strategies to facilitate variant
and haplotype detection?
• How to best represent the resolved alternate loci in VCF format?
31. Many thanks for helpful discussions with:
Deanna Church
Brad Holmes
Karyn Meltz-Steinberg
Heng Li
Colin Hercus