GRCh38 is a new version of the human reference genome that features several improvements over the previous version, GRCh37. It includes 178 regions comprising 3.15% of the genome sequence that have been updated based on new data. GRCh38 also includes 261 alternate loci comprising 3.6 Mb of novel sequence not present in GRCh37. Model centromeres have been added to chromosomes for the first time, representing heterochromatic regions totaling over 60 Mb. In addition, over 800 kb of novel sequence has been added through 73 patches of previously unrepresented DNA.
Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
GRC Workshop at Churchill College on Sep 21, 2014. This is Aaron Quinlan's talk on issues with representing variants in the full assembly, with suggestions for VCF modifications for handling variant calls on the alts.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Presentation by Valerie Schneider discussing Genome Reference Consortium (GRC) plans for the mouse and zebrafish reference genome assemblies, presented at the 2016 meeting of the The Allied Genetic Conference (TAGC). Includes description of resources at the National Center for Biotechnology Information (NCBI) for working with reference genome assemblies.
GRC Workshop at Churchill College on Sep 21, 2014. This is Aaron Quinlan's talk on issues with representing variants in the full assembly, with suggestions for VCF modifications for handling variant calls on the alts.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Using the GRCh38 reference assembly for clinical interpretation in VSClinicalGolden Helix
Although the latest reference genome (build 38) was released in 2009, it has taken quite a while to come into its own as a baseline for the clinical interpretation of variants in human disease. A lot of this was momentum, while some of it was concerns about compatibility with other labs and published literature. Yet the largest hindrance was the lack of support of the bioinformatic tools and requisite databases required to analyze variants. When we released VSClinical, we wanted those concerns to be removed from the choice of what reference genome a lab may choose to use.
In this webcast, we will:
Review the cost and benefits of using the new GRCh38 reference genome in the context of clinical genetics.
Look at the public data resources that have native support, and Golden Helix’s effort to lift over the ones that do not.
Provide examples of situations where changing reference genomes introduces or negates artifacts caused by errors in the reference sequence.
Demonstrate VarSeq’s new ability to lift over to the new reference genome while importing your VCFs into a project.
Go through VSClinical using the new human reference genome, with full annotation support and downstream use of assessment catalogs and writing of reports.
Whether you have already made the switch or considering the possibility, this webcast will provide you with what you need to know about using the new reference genome for clinical genetic testing as well as human disease research using VarSeq. We hope to see you there!
The CRISPR-Cas9 system has emerged as one of the leading tools for modifying genomes of organisms ranging from E. coli to humans. One of the key components of this editing system is Cas9 endonuclease. The cleavage activity of the S. pyogenes Cas9 enzyme is mediated by the coordinated functions of two catalytic domains and creates blunt-ended, double-stranded breaks. Alanine substitution at key residues within these domains creates two Cas9 nickase variants. Variant D10A produces a nick on the targeting strand, while H840A nicks the non-targeting strand. This double nicking strategy can be leveraged to reduce unwanted off-target effect. However, the nickase experiments can be inherently more complicated than standard CRISPR-Cas9 editing, given the requirement for two guide RNAs to function simultaneously.
In this webinar, both Shuqi Yan and Mollie Schubert present the data from the characterization of a number of factors that impact the efficiency of cooperative nicking in cell cultures. They also summarize a few key design considerations for achieving efficient gene disruption or homology directed repair (HDR) when planning your nickase experiments.
Learn more: http://www.idtdna.com/pages/products/crispr-genome-editing
Presentation by Karen Miga at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on centromere assemblies.
GRC Workshop at Churchill College on Sep 21, 2014. This is Paul Kitt's talk describing the NCBI approach to annotation the full human reference assembly.
TIS prediction in human cDNAs with high accuracyAnax Fotopoulos
Correct identification of the Translation Initiation Start (TIS) in cDNA is an important issue for genome annotation. The aim of this work is to improve upon current methods and provide a performance guaranteed prediction.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
Presentation at 2019 ASHG GRC/GIAB workshop describing goals and progress of the telomere-to-telomere consortium to generate a genome assembly that provides representation of all sequences, including repetitive regions.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
7. GRCh38 Sequence Updates
Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components
n=10489
79% of these bases are heterozygous in RP11 WGS
14. GRCh38 Path Updates
HYDIN: chr16 (16q22.2)
Doggett et al., 2006
HYDIN2: chr1 (1q21.1)
Missing in NCBI35/NCBI36
Unlocalized in GRCh37
Placed in GRCh38
Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID
Alignment of HYDIN CHM1_1.0, >99.9% ID
16. GRCh38 Alt Loci
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
21. GRCh38: Alt Loci
Masks and alt aware aligners reduce the incidence of
ambiguous alignments observed when aligning reads to
the full assembly
Mask1: mask chr for fix patches, scaffold for novel/alts.
Mask2: mask only on scaffolds
23. GRCh38 Credits
Collaborators
GRC SAB
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
NCBI RefSeq and gpipe annotation team
Havana annotators
Karen Miga
David Schwartz
Steve Goldstein
Mario Caceres
Giulio Genovese
Jeff Kidd
Peter Lansdorp
Mark Hills
David Page
Jim Knight
Stephan Schuster
1000 Genomes
Rick Myers
Granger Sutton
Evan Eichler
Jim Kent
Roderic Guigo
Carol Bult
Derek Stemple
Matthew Hurles
Richard Gibbs
Editor's Notes
The Genome Reference Consortium released the latest human reference assembly, GRCh38, on Dec. 24. While this updated assembly has many improvements, and some groups have been eagerly awaiting its release, the GRC is well aware that many users may feel the same way about GRCh38 as we all feel about the gift of new socks.Today I’m going to tell you about some of the new features in the assembly and how these updates make GRCh38 a better substrate for analyses. In the end, I’d like to convince you that whether GRCh38 was on your wish list or not, like a new pair of socks, it’s in better shape than what’s sitting in your wardrobe and ultimately, you’ll be able to put it to good use.
GRCh37 was released in 2009, and used a new assembly model in which alternate loci scaffolds were included to provide additional sequence representations for variant genomic regions. GRCh37 had 3 such regions, and 9 alternate loci scaffolds.Since then, the GRC has continued to update the assembly. Many of these updates were released as non-coordinate-changing patch scaffolds. The patches came in two flavors:FIX patches corrected problems in existing assembly sequenceNOVEL patches added new alternate sequence representationsAs shown in the box, nearly 200 regions of GRCh37 were associated with a patch, and these updates added almost 8 Mb of novel sequence to the reference assembly. Furthermore, not every assembly update was released as a patch. As this pie chart shows, the GRC resolved just over 1000 issues for GRCh38. As a result, the GRC and members of its SAB, agreed that it was time for a major assembly release.So today we have GRCh38, which now has 178 regions associated with 261 alternate loci scaffolds. There is more than 3 Mb of sequence whose only representation in the assembly occurs in the alternate loci.
I’d now like to introduce GRCh38 with some basic assembly statistics. These and additional stats for GRCh38 assembly are available on the GRC website.One measure of an assembly’s continuity is scaffold N50. You can see here that scaffold N50s increased for almost every chromosome in GRCh38, indicating the reference assembly is more contiguous than ever.
We can also compare GRCh38 to GRCh37, using a common annotation input set.There was a 5% increase in the number of aligned genes and a 3% increase in the number of aligned protein coding transcripts. There was alsoa decrease in both the numbers of annotated partial CDS and split genes (genes that span gaps).An example of one such improvement is shown here. In blue, can see the tiling path in GRCh37, where there is a gap. The TWIST2 gene spans this gap. In GRCh38, the gap has been closed by the addition of new sequence and there is complete representation for the gene.In this example, the added sequence was RP11 WGS, provided by Jim Knight, who has been working with Stephan Schuster and others on an RP11 WGS assembly (poster). The GRC used WGS sequence from this and several other WGS assemblies, including HuRef, CHM1_1.1 and the NA12878 ALLPATHS, to extend into or span gaps when clone-based sequence could not be found.
One of the updates made in the assembly was the correction of erroneous bases. The human genome is approximately 2.85 billion bases and the finished human reference assembly is accurate to an error rate of 1 per 100,000 bases. While this represents the highest quality mammalian genome assembly in existence today, it still means that an approximate 28,000 bases are incorrect. The GRC made the correction of erroneous bases a priority for GRCh38.This slide shows the bases whose updates were considered by the GRC:The largest set were ~15K SNV with MAF=0 in the 1000G phase 1 analysis.1000G also identified ~2.5K indels with MAF=0These two sets represented bases that were asserted to be incorrect in the reference assembly, as they were never seen in 1000G.An additional 1413 bases with MAF<5% (but >0%) that overlap pseudogenes, processed transcripts or polymorphic pseudogenes were also consideredAs were ~200 base update requests from annotators and clinical labs
Before attempting any of the updates, the GRC did some analysis to determine whether the bases with MAF=0 were sequencing errors or unrecognized variants. To do this, we performed a read pile-up analysis for a subset of these bases for which we had WGS data from the same genome as the reference assembly sequence. These were bases in RP11 BAC clones, which make up 70% of the reference assembly. The RP11 WGS sequence used in this analysis was generated at WashU. First graph shows the results of the pile up analysis for the SNVs: (X axis is chromosomes)Purple: Proportion of “never seen” bases that are heterozygous in RP11 (hetalt: not errors)Red: Proportion of “never seen” bases that are not seen in RP11 (hmalt: genuine errors)Across all chromosomes: 79% “never seen” SNVs are heterozygous in RP11 WGS, indicative of unrecognized variation, rather than sequencing error.The GRC did not update the heterozygous RP11 bases.
Ultimately, the GRC attempted to update 9359 bases.Of these, we succeeded in updating 8128 sites (86.8%) with mini-contigs we built from WGS reads from 1000G samples or the RP11 genome. The reads were assembled into the mini-contigs with cortex_con and differ from reference only at selected base. These were all submitted to GenBank.The Ensembl VEP found 8188 variants associated with the sites updated by mini-contigs. Most updates are not in coding sequence. Among those variants with coding consequences, most are missense or synonymous, consistent with most of the updates being SNVs. Consequences of note include:15 genes that had an internal stop codon in GRCh37 are now coding78 genes had a frameshift relative to GRCh37 that restored gene function2 genes that were coding in GRCh37 are now non-coding, but do represent the more common allele (CASP12/PRM3)
The first new feature of GRCh38 I want to mention are the centromeres. Until now, centromeres have been represented in the reference assembly by very large gaps. This is unfortunate, because centromeres play important roles in biology. Contrary to popular belief, centromeres aren’t difficult to sequence. In fact, there are large datasets of centromere sequence out there that are just waiting for a reference so that they can be analyzed.The challenge has been their assembly, which is complicated by their highly repetitive nature. As illustrated here, centromeres are comprised largely of tandemly repeated alpha-satellite sequences, that exhibit a wide range of variation. These short repeats are organized into longer higher order arrays that are highly identical. Because the centromeres are so long, they are difficult to assemble with even the longest read technology.
Centromeric sequence assembly is further complicated by the fact that these higher order arrays can vary between individuals and vary between homologous chromosomes in the same individual.
The GRC was fortunate to be contacted by Karen Miga, a postdoc in Jim Kent’s lab, who was developing an approach for generating modeled centromere sequences. All of the work I’m going to talk about was done by Karen and will soon be published in Genome Research.In short, Karen created a database of centromeric WGS reads from the HuRef genome. She determined the chromosome-specific higher order array structures and then build statistical linear models that could be used in the reference assembly, where they will serve as targets for read mapping.This next slide just shows a schematized version of graph-based representations for each of the chromosome-specific higher order arrays.
In these graphs, the nodes represent identical monomers and the edges are the likelihood of their adjacency in the array. Karen used a hidden Markov-based tool called LinearSat to build statistically based linear models from these graphs.It’s important to understand that each model represents the variants and monomer ordering in a proportional manner to that observed in the initial read database, but the long-range ordering of the repeats represents only an inferred sequence.Karen further used mate pair mapping to identify euchromatic WGS sequences from the HuRef assembly that are associated with the arrays. Like the repeats, the long range ordering of these euchromatic contigs in the models is also an inference.Users can find the coordinates of the centromere sequences in a table on the GRC website.
In addition to adding centromere sequence, the GRC has focused on adding human-specific sequences to the reference assembly.An example of this is the SRGAP gene family, which is involved in cortical development. The ancestral 1q32 gene has been duplicated in humans to 1p21 and 1q21. Work from EvanEichler’s lab found that not only were the 3 SRGAP2 human paralogs incompletely sequenced in GRCh37, but that allelic and paralogous sequences had been mixed in the assembly. 1q21 was the worst of these misassemblies, containing multiple haplotypes due to the highly duplicated nature of the region. Only by use of a single haplotype hydatidiform mole resource was it possible to disambiguate the correct paths at each locus. These updated paths were originally released as fix patches to GRCh37 and are now incorporated in the GRCh38 chromosomes. This panel shows the GRCh37.p13-GRCh38 assembly-assembly alignments in the 1Q21 region.The alignment of the GRCh37 chromosome sequence is highly fragmented, indicative of the large changes that were made.Also aligning to this region of GRCh38 is a GRCh37 chr. 1 unlocalized scaffold. This scaffold contained the HYDIN2 gene.
HYDIN2 represents another human specific gene duplication, also involved in neuronal phenotypes. The human genome contains two HYDIN loci: HYDIN on chr. 16, and HYDIN2 on chr. 1. The HYDIN2 locus was absent from previous assembly versions, unlocalized scaffold in GRCh37 and placed in GRCh38.This slide shows the alignment of the HYDIN2 and HYDIN genes from the CHM1 genome assembly (TINA POSTER) to the chr.16 HYDIN locus in the GRCh37 assembly. The HYDIN2 alignment reflects paralogous sequence differences, while the HYDIN alignment reflects allelic differences. The alignments show that the 2 loci are highly similar, explaining why it was so difficult to disambiguate the two genes. In fact, the sequences are so similar, in NCBI34, sequences from the two genes were mixed at the same locus.The high degree of similarity has complicated variation analysis of these two paralogous genes. The absence of the chr. 1 paralog in previous assembly versions has likely led to likely erroneous variant calling at the chr. 16 locus. Zooming in, we see a paralogous sequence variant in HYDIN2 that occurs at the position of an annotated SNP in HYDIN. Now that HYDIN2 is present in GRCh38, we can begin to address issues such as this.
Another set of sequences that the GRC was interested in capturing for GRCh38 was the 1000G decoy sequence. This was a 35 Mb collection of sequences that were not represented in the GRCh37 primary assembly. They were included in the 1000G phase 2 alignment target set as a read trap, as analyses showed they improved variation calling. The decoy sequences had an average repeat content of ~80%.In order to assess decoy capture in GRCh38, we looked at reads from two 1000G samples that previously aligned only to the decoy. Depending on the sample, we find that 70-75% of such reads now align to the GRCh38 primary assembly. An additional 1% percent of reads are captured when the full assembly is used as a substrate and the alt loci are present. Thus, while not fully representing the decoy, GRCh38 does include a significant portion of this important sequence and is therefore a better alignment target than GRCh37. We continue to pursue the capture of the remaining decoy, much of which is highly repetitive, in a meaningful way in the reference assembly.
This brings me to the alternate loci, which are now present in greater number and locations than ever.In the original reference assembly model, there was no good way to handle variant genomic regions. Frequently, sequences from multiple haplotypes were inserted and confounded assembly, leading to artificial gaps. In the assembly model we’re using now, there’s a mechanism to cleanly represent multiple haplotypes : these are the alternate loci. They allow the reference assembly to contain alternate representations for regions where a single sequence path is considered insufficient, while retaining the linear chromosome models that most users are comfortable with. The corollary of this statement is that the reference assembly may represent >1 allele at a locus.
So, why is it important to use the alternate loci? One simplereason is gene content. In GRCh38, there are 64 protein coding and 112 non-protein coding genes that are found only on the alternate loci.An example is shown in this slide. This image shows an alternate locus scaffold from chromosome 22. Grey bar is assembly component, green bars are genes, and the alignment is below. You can see several genes annotated in the region of the alt that has no alignment to the chromosome.Thus, if you’re not using the entire assembly in your analyses, you may be missing genes. This can affect the development of exome capture reagents. In addition, many of these alts contain paralogous gene copies that will affect alignments and your understanding of the protein content of the genome.
Alternate loci also have implications for genome interpretation:In this example, we’re looking at structural variation in the APOBEC locus on chr. 22. There is a deletion variant that results in the fusion of the APOBEC3A and 3B genes.Deletion allele is prevalent in Asians and South America. GRCh38 contains the deletion allele on an alt loci scaffold. This is a common polymorphism for which the alt contains the predominant allele for certain populations.This image shows reads from two Asian 1000G samples that align in the APOBEC intergenic region in GRCh37, displayed in the NCBI 1000G browser. B/c the samples are heterozygous, but are aligned to the primary assembly, which has only the insertion variant, it complicates the alignments. Can see that different methods give different results. Use of the full assembly, an alignment substrate that includes both variants, would likely improve the interpretation of the data.
We’ve been doing some analyses to investigate the severity of mapping errors that can occur when alternate loci aren’t used in alignment target sets. Since our analyses of GRCh38 are ongoing, I’ll talk today about a study we did with the GRCh37.p9 assembly. In that study, we looked at the behavior of simulated reads sourced from sequence unique to GRCh37.p9 patches or alternate loci. We asked what happened to them when aligned to GRCh37 primary assembly+MT, where their true target is missing. We aligned the reads either as singletons or pairs, using two different aligners (BWA and srprism).As shown in this graph, regardless of read pairing or the aligner, 25% of these reads failed to align (red). What’s particularly concerning is that nearly three-quarters have an off-target alignment on the GRCh37 primary assembly (in blue). These off-target alignments are likely to result in errors in variation analyses.This analysis demonstrates the value of including alternate loci in alignment target sets.
That being said, most commonly used short read aligners can’t currently handle the allelic duplication introduced into the assembly by non-unique sequences in alt loci. Mapping scores for reads aligning to both the alt and the corresponding chromosome region are depressed and excluded from analysis.As a result, new alternate aware tools that understand the relationship of the alt to the chromosome and don’t depress scores are needed in order for users to take advantage of the full reference assembly. Some aligners, such as iBWA and srprism, can now do this, but other aspects of variant calling tool chains still need to be updated to address this issue of allelic duplication.In the interim, the GRC has been looking at approaches that may help users make use of existing tool chains. For example, we’ve tested use of a mask that hides the duplication in the alts. In this slide, you can see the mask we’ve generated for this NOVEL patch, which has an insertion relative to the chromosome, but is identical for much of the remaining length.
We have looked at the effect of masking on BWA alignments and compared results to those obtained with use of the alternate aware aligner, srprism. In this analysis, simulated reads were aligned to GRCh37.p9 primary or full assembly. For BWA, we tested masking of the alts/patches only, or masking a combination of sequences on the alts/patches and the chromosome. We then looked at the incidence of reads with ambiguous alignments.As shown in first two columns of the figure, there is an expected increase in multiple alignments when reads are aligned to the full assembly with BWA and no mask (expanded red). In the next two columns, you can see how use of either masking approach suppresses the increase in multiple alignments. The last two columns show that srprism, the alt aware aligner, does not need a mask to prevent ambiguous mappings.We’ll be following up this analysis on GRCh38, but I hope that even this preliminary data makes the point that it is possible to develop tools that can handle the alternate loci and may allow users to reap the benefits of using the full assembly in analyses.
On that note, I’d like to wrap things up. I’d like to think I’ve convinced you that:It was time for an updateThe reference has improvedUpdates and new features will make the reference a better substrate for analysisFor those of you ready to make the switch, I’d like to plug the NCBI remapping service, which uses assembly-assembly alignments to remap features from one assembly to another. This tool can be used for mapping between GRCh37 and GRCh38. It is available as a web interface, as well as a perl script API.While you may not be excited by the new assembly as these folks are with their socks, it’s a far cry from a lump of coal.