Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
The summary is as follows:
1) Creating reference-grade human genome assemblies is an ongoing process as technologies improve and additional samples are sequenced to better represent global genetic diversity.
2) New long-read sequencing technologies have enabled improved assemblies of genomes like CHM1 that resolve structural variants and fill gaps compared to the current reference.
3) Additional "gold standard" genomes from diverse populations are being sequenced, assembled and improved to provide more representative references.
The document discusses efforts to improve the human reference genome by generating new reference-grade assemblies using long-read sequencing technologies. Several human genomes are being sequenced to high coverage and assembled using PacBio long reads to generate "gold standard" references representing more of human genetic diversity. Assemblies are being improved using optical mapping data and finished BAC sequences. The improved assemblies will help represent structural variation and allelic diversity more accurately than the current reference.
The human reference genome is a work in progress that does not fully represent global genetic diversity. This project aims to improve reference genomes by sequencing additional genomes from diverse populations at high coverage, including genomes from Yoruba, Puerto Rican, Han Chinese, and Colombian individuals. New long read sequencing technologies allow generation of more complete diploid genome assemblies. These "Gold Standard" genomes will help improve and expand the human reference to better represent human genetic variation worldwide.
This document describes a method for haplotype-resolved structural variant assembly using long reads. PacBio and BioNano data are hybrid assembled to generate highly contiguous and complete haplotype-specific assemblies. The hybrid approach resolves many gaps in current references assemblies and detects complex structural variants and rearrangements. Analysis of trios from the 1000 Genomes Project and GIAB project using this pipeline detects numerous insertions, deletions, inversions and other structural variants.
The document discusses new challenges in reference assembly management due to changing sequencing technologies and new resources. It summarizes an evaluation of de novo assemblies of the CHM1 and CHM13 haploid hydatidiform mole samples from different assemblers. Key metrics like contiguity, alignment to reference transcripts and protein coding regions, and error rates are compared between the assemblies and reference. The evaluation aims to assess suitability of the assemblies for use in reference curation.
This document discusses relating new genome assemblies to the human reference genome GRCh38. It provides an overview of changes in reference sources, evaluating new sequences, and the future of assembly curation. GRCh38 contains 178 regions with alternative loci comprising 2% of the sequence from multiple whole genome sequencing projects. The Assemblathon project evaluated assemblies of the CHM13 hydatidiform mole genome to assess data quality, aligners, and identify improvements for the reference. Future work will integrate additional platinum genomes and develop local reassembly and graph-based references.
The document discusses laboratory techniques for generating high-quality genome assemblies, including PacBio long-read sequencing, 10X Genomics linked reads, and BioNano physical mapping. PacBio sequencing of various library preparation methods produced reads over 10kb in length. 10X Genomics linked reads provided long-range phasing information to resolve alleles from repeats. BioNano mapping revealed a large inversion in one genome through detection of nick sites. The integration of these long-read and long-range techniques aims to capture more human genetic diversity in reference genomes.
The summary is as follows:
1) Creating reference-grade human genome assemblies is an ongoing process as technologies improve and additional samples are sequenced to better represent global genetic diversity.
2) New long-read sequencing technologies have enabled improved assemblies of genomes like CHM1 that resolve structural variants and fill gaps compared to the current reference.
3) Additional "gold standard" genomes from diverse populations are being sequenced, assembled and improved to provide more representative references.
The document discusses efforts to improve the human reference genome by generating new reference-grade assemblies using long-read sequencing technologies. Several human genomes are being sequenced to high coverage and assembled using PacBio long reads to generate "gold standard" references representing more of human genetic diversity. Assemblies are being improved using optical mapping data and finished BAC sequences. The improved assemblies will help represent structural variation and allelic diversity more accurately than the current reference.
The human reference genome is a work in progress that does not fully represent global genetic diversity. This project aims to improve reference genomes by sequencing additional genomes from diverse populations at high coverage, including genomes from Yoruba, Puerto Rican, Han Chinese, and Colombian individuals. New long read sequencing technologies allow generation of more complete diploid genome assemblies. These "Gold Standard" genomes will help improve and expand the human reference to better represent human genetic variation worldwide.
This document describes a method for haplotype-resolved structural variant assembly using long reads. PacBio and BioNano data are hybrid assembled to generate highly contiguous and complete haplotype-specific assemblies. The hybrid approach resolves many gaps in current references assemblies and detects complex structural variants and rearrangements. Analysis of trios from the 1000 Genomes Project and GIAB project using this pipeline detects numerous insertions, deletions, inversions and other structural variants.
The document discusses new challenges in reference assembly management due to changing sequencing technologies and new resources. It summarizes an evaluation of de novo assemblies of the CHM1 and CHM13 haploid hydatidiform mole samples from different assemblers. Key metrics like contiguity, alignment to reference transcripts and protein coding regions, and error rates are compared between the assemblies and reference. The evaluation aims to assess suitability of the assemblies for use in reference curation.
This document discusses relating new genome assemblies to the human reference genome GRCh38. It provides an overview of changes in reference sources, evaluating new sequences, and the future of assembly curation. GRCh38 contains 178 regions with alternative loci comprising 2% of the sequence from multiple whole genome sequencing projects. The Assemblathon project evaluated assemblies of the CHM13 hydatidiform mole genome to assess data quality, aligners, and identify improvements for the reference. Future work will integrate additional platinum genomes and develop local reassembly and graph-based references.
The document discusses laboratory techniques for generating high-quality genome assemblies, including PacBio long-read sequencing, 10X Genomics linked reads, and BioNano physical mapping. PacBio sequencing of various library preparation methods produced reads over 10kb in length. 10X Genomics linked reads provided long-range phasing information to resolve alleles from repeats. BioNano mapping revealed a large inversion in one genome through detection of nick sites. The integration of these long-read and long-range techniques aims to capture more human genetic diversity in reference genomes.
1. The PacBio assembly of the CHM1 genome had an N50 contig length of 4.5 MB and potentially fills gaps in the GRCh38 reference genome.
2. Multiple assemblies of the CHM1 genome were generated using different techniques and are being evaluated based on contiguity, annotation, and concordance with other data to select the best assembly.
3. The goal is to generate a high-quality "Platinum Genome" for CHM1 by improving the best assembly with additional data sources like BAC clones and using it as a new reference genome. A second individual, CHM13, is also being assembled to increase genome diversity.
The human reference genome is incomplete and does not fully represent structural variation. Additional sequences are needed to represent diversity. A hydatidiform mole genome (CHM1) provides an alternate haploid reference with differences from the diploid human reference. The current CHM1 assembly incorporates BAC sequences and Illumina reads. Future work includes improving the assembly using long read technologies and integrating it into the human reference to better represent human variation.
This document summarizes sequencing and mapping plans and results for generating reference genomes. It discusses using PacBio to generate long reads, 10X Genomics for linked reads, and BioNano for physical mapping. Optimization of protocols was needed. Preliminary results showed the approaches provided complementary data to improve reference genomes, with each system having unique benefits and challenges. Further integration of the data sets could generate more robust reference genomes representing human genetic diversity.
Generating haplotype phased reference genomes for the dikaryotic wheat strip...Benjamin Schwessinger
The document summarizes work to generate haplotype phased reference genomes for the wheat stripe rust fungus Puccinia striformis f. sp. tritici. High quality DNA was extracted and sequenced using PacBio long reads, resulting in an assembly of under 400 contigs. Mapping of the primary and associated contigs showed heterozygosity between the two dikaryotic nuclei. Future work includes repeat annotation, RNAseq mapping, sequencing additional isolates, and single nucleus sequencing to better understand the dikaryotic nature of the fungus and its success. The work aims to generate chromosomally-level assemblies of both dikaryotic nuclei.
The document discusses the human reference genome assembly GRCh38 and how to access and utilize its data. It provides an overview of assembly basics, describes how GRCh38 represents haplotypes and alternative loci. Key statistics about GRCh38 such as improved contiguity, novel sequence, and updated annotations are highlighted. Resources for accessing GRCh38 data from the Genome Reference Consortium website and other sources are also reviewed.
The document discusses the human reference assembly and some key points:
1. The current reference assembly (GRCh38) represents two haplotypes and includes alternate loci to represent structural variation.
2. The assembly is improved from previous versions through the inclusion of 178 regions with alternate loci representing 261 alternate loci and 96 patches of novel sequence totaling over 5 megabytes.
3. Relevant assembly data can be accessed from the Genome Reference Consortium website including sequence files, annotations, and reports on assembly regions like alternate loci and centromeres.
The document discusses updates to the human reference genome assembly GRCh38. It provides background on reference assemblies and describes how the Genome Reference Consortium manages and models genome assemblies. Key points include that GRCh38 contains refined centromere regions based on new data, novel sequence detections, and 261 alternate loci representing structural variants. The assembly is now incorporated into public sequence databases to improve access and use of the reference genome data.
The document discusses gEVAL Browser, a tool for evaluating genome assemblies developed by the Sanger Institute. It allows users to navigate and view annotations of different assemblies, including the GRCh38, GRCh37, and HuRef assemblies. The document also describes the GRC TrackHub, which displays genomic issues and regions of interest identified by the Genome Reference Consortium on the Ensembl and UCSC browsers.
This document summarizes work being done to improve human reference genomes using alternative samples. It notes that the initial human reference is incomplete and additional sequences are needed to represent diversity. It then describes efforts to generate "platinum" quality assemblies of additional samples, including CHM1 and CHM13, using long read sequencing and scaffolding with optical mapping. Initial assembly stats are provided for CHM13 and NA19240, and future plans include integrating targeted sequences, adding more diversity, and developing tools to utilize alternate haplotypes in the reference.
GRCh38 is a new version of the human reference genome that features several improvements over the previous version, GRCh37. It includes 178 regions comprising 3.15% of the genome sequence that have been updated based on new data. GRCh38 also includes 261 alternate loci comprising 3.6 Mb of novel sequence not present in GRCh37. Model centromeres have been added to chromosomes for the first time, representing heterochromatic regions totaling over 60 Mb. In addition, over 800 kb of novel sequence has been added through 73 patches of previously unrepresented DNA.
The document discusses the human reference genome assembly GRCh38. It provides information on assembly basics, the assembly model used, updates made in GRCh38 compared to the previous version, and how to access and utilize the sequence and annotation data. Key points include that GRCh38 represents 178 genomic regions with alternative loci sequences, contains over 400kb of novel sequence from targeted updates, and is collaboratively maintained to integrate new data.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
This document summarizes RefSeq's curation and annotation of the reference human genome GRCh38. It discusses how RefSeq provides manual curation of known transcripts and proteins as well as model annotations from computational pipelines. It also describes RefSeq's collaboration with other groups to transition annotations from GRCh37 to GRCh38 and handle structural variations and alternative loci.
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
The human reference genome is becoming more complex, moving from a single consensus sequence to representing multiple haplotypes and genomic diversity. The current assembly model, GRCh38, includes 178 regions with alternative loci sequences totaling 3.6 Mb of novel sequence not present in previous assemblies. Future assemblies will aim to better define sequence contexts and provide coordinate information for multiple genomes and patches. Challenges include developing compatible analysis tools and determining how to best represent updated regions in new assembly releases.
Presentation by Valerie Schneider at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
The document describes using PromethION nanopore sequencing to generate high-quality human reference genomes. 11 reference genomes were sequenced in 9 days using PromethION, achieving high consensus accuracy (>99%) and continuity. The approach leverages long reads for assembly followed by polishing and scaffolding. This high-throughput and accurate method can generate reference genomes at an estimated cost of $10,000 per genome.
This document discusses using presence-absence variation (PAV) analysis to assess genome assemblies and identify foreign DNA. It describes the ScanPAV pipeline which extracts and analyzes PAV sequences between assemblies. The document provides examples of ScanPAV being used to evaluate human genome assemblies and identify contaminants in Tasmanian devil cancer genomes. Key findings include Mycoplasma and Streptococcus contamination identified in some devil cancer assemblies but no exogenous contribution found to the emergence of transmissible devil facial tumors.
1. The PacBio assembly of the CHM1 genome had an N50 contig length of 4.5 MB and potentially fills gaps in the GRCh38 reference genome.
2. Multiple assemblies of the CHM1 genome were generated using different techniques and are being evaluated based on contiguity, annotation, and concordance with other data to select the best assembly.
3. The goal is to generate a high-quality "Platinum Genome" for CHM1 by improving the best assembly with additional data sources like BAC clones and using it as a new reference genome. A second individual, CHM13, is also being assembled to increase genome diversity.
The human reference genome is incomplete and does not fully represent structural variation. Additional sequences are needed to represent diversity. A hydatidiform mole genome (CHM1) provides an alternate haploid reference with differences from the diploid human reference. The current CHM1 assembly incorporates BAC sequences and Illumina reads. Future work includes improving the assembly using long read technologies and integrating it into the human reference to better represent human variation.
This document summarizes sequencing and mapping plans and results for generating reference genomes. It discusses using PacBio to generate long reads, 10X Genomics for linked reads, and BioNano for physical mapping. Optimization of protocols was needed. Preliminary results showed the approaches provided complementary data to improve reference genomes, with each system having unique benefits and challenges. Further integration of the data sets could generate more robust reference genomes representing human genetic diversity.
Generating haplotype phased reference genomes for the dikaryotic wheat strip...Benjamin Schwessinger
The document summarizes work to generate haplotype phased reference genomes for the wheat stripe rust fungus Puccinia striformis f. sp. tritici. High quality DNA was extracted and sequenced using PacBio long reads, resulting in an assembly of under 400 contigs. Mapping of the primary and associated contigs showed heterozygosity between the two dikaryotic nuclei. Future work includes repeat annotation, RNAseq mapping, sequencing additional isolates, and single nucleus sequencing to better understand the dikaryotic nature of the fungus and its success. The work aims to generate chromosomally-level assemblies of both dikaryotic nuclei.
The document discusses the human reference genome assembly GRCh38 and how to access and utilize its data. It provides an overview of assembly basics, describes how GRCh38 represents haplotypes and alternative loci. Key statistics about GRCh38 such as improved contiguity, novel sequence, and updated annotations are highlighted. Resources for accessing GRCh38 data from the Genome Reference Consortium website and other sources are also reviewed.
The document discusses the human reference assembly and some key points:
1. The current reference assembly (GRCh38) represents two haplotypes and includes alternate loci to represent structural variation.
2. The assembly is improved from previous versions through the inclusion of 178 regions with alternate loci representing 261 alternate loci and 96 patches of novel sequence totaling over 5 megabytes.
3. Relevant assembly data can be accessed from the Genome Reference Consortium website including sequence files, annotations, and reports on assembly regions like alternate loci and centromeres.
The document discusses updates to the human reference genome assembly GRCh38. It provides background on reference assemblies and describes how the Genome Reference Consortium manages and models genome assemblies. Key points include that GRCh38 contains refined centromere regions based on new data, novel sequence detections, and 261 alternate loci representing structural variants. The assembly is now incorporated into public sequence databases to improve access and use of the reference genome data.
The document discusses gEVAL Browser, a tool for evaluating genome assemblies developed by the Sanger Institute. It allows users to navigate and view annotations of different assemblies, including the GRCh38, GRCh37, and HuRef assemblies. The document also describes the GRC TrackHub, which displays genomic issues and regions of interest identified by the Genome Reference Consortium on the Ensembl and UCSC browsers.
This document summarizes work being done to improve human reference genomes using alternative samples. It notes that the initial human reference is incomplete and additional sequences are needed to represent diversity. It then describes efforts to generate "platinum" quality assemblies of additional samples, including CHM1 and CHM13, using long read sequencing and scaffolding with optical mapping. Initial assembly stats are provided for CHM13 and NA19240, and future plans include integrating targeted sequences, adding more diversity, and developing tools to utilize alternate haplotypes in the reference.
GRCh38 is a new version of the human reference genome that features several improvements over the previous version, GRCh37. It includes 178 regions comprising 3.15% of the genome sequence that have been updated based on new data. GRCh38 also includes 261 alternate loci comprising 3.6 Mb of novel sequence not present in GRCh37. Model centromeres have been added to chromosomes for the first time, representing heterochromatic regions totaling over 60 Mb. In addition, over 800 kb of novel sequence has been added through 73 patches of previously unrepresented DNA.
The document discusses the human reference genome assembly GRCh38. It provides information on assembly basics, the assembly model used, updates made in GRCh38 compared to the previous version, and how to access and utilize the sequence and annotation data. Key points include that GRCh38 represents 178 genomic regions with alternative loci sequences, contains over 400kb of novel sequence from targeted updates, and is collaboratively maintained to integrate new data.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
This document summarizes RefSeq's curation and annotation of the reference human genome GRCh38. It discusses how RefSeq provides manual curation of known transcripts and proteins as well as model annotations from computational pipelines. It also describes RefSeq's collaboration with other groups to transition annotations from GRCh37 to GRCh38 and handle structural variations and alternative loci.
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
The human reference genome is becoming more complex, moving from a single consensus sequence to representing multiple haplotypes and genomic diversity. The current assembly model, GRCh38, includes 178 regions with alternative loci sequences totaling 3.6 Mb of novel sequence not present in previous assemblies. Future assemblies will aim to better define sequence contexts and provide coordinate information for multiple genomes and patches. Challenges include developing compatible analysis tools and determining how to best represent updated regions in new assembly releases.
Presentation by Valerie Schneider at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
The document describes using PromethION nanopore sequencing to generate high-quality human reference genomes. 11 reference genomes were sequenced in 9 days using PromethION, achieving high consensus accuracy (>99%) and continuity. The approach leverages long reads for assembly followed by polishing and scaffolding. This high-throughput and accurate method can generate reference genomes at an estimated cost of $10,000 per genome.
This document discusses using presence-absence variation (PAV) analysis to assess genome assemblies and identify foreign DNA. It describes the ScanPAV pipeline which extracts and analyzes PAV sequences between assemblies. The document provides examples of ScanPAV being used to evaluate human genome assemblies and identify contaminants in Tasmanian devil cancer genomes. Key findings include Mycoplasma and Streptococcus contamination identified in some devil cancer assemblies but no exogenous contribution found to the emergence of transmissible devil facial tumors.
The document summarizes research on generating high-quality human reference genomes using PromethION nanopore sequencing. Key points:
- 11 human reference genomes were sequenced in 9 days using PromethION nanopore sequencing and assembly tools, achieving finished assemblies.
- The sequencing strategy included enriching for ultra-long reads over 100kb using a short read eliminator kit to boost overall coverage of long reads.
- Evaluation of one genome assembly showed over 99% consensus base accuracy when aligned to the human reference genome and over 99.76% accuracy for alignments of complete BAC sequences.
- The research aims to further improve assembly quality and reduce costs while increasing throughput using PromethION sequencing and optimized assembly tools
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
A paper presented at the annual Italian Database conference (SEBD): http://sisinflab.poliba.it/sebd/2018/
here is the paper: http://sisinflab.poliba.it/sebd/2018/papers/June-27-Wednesday/1-Big-Data/SEBD_2018_paper_23.pdf
This document provides an overview and discussion of next-generation sequencing technologies by C. Titus Brown. It begins by outlining some basics of shotgun sequencing and how increasing density leads to squared increases in the number of sequences obtained. It then discusses current costs for Illumina sequencing and the amount of data needed for different applications. Some challenges and problems with sequencing data are also reviewed, such as systematic bias, genome assembly and scaffolding difficulties, reference gene models, and mRNA isoform construction. Emerging long-read sequencing technologies are also briefly discussed.
This document discusses using Microsoft Azure cloud computing resources to conduct genome-wide association studies (GWAS) and polygenic risk scoring (PRS) to predict COVID-19 mortality. Key steps include acquiring genotype and phenotype data, performing quality control, running GWAS and PRS analyses using HPC clusters on Azure, and downloading results. Azure provides scalable computing and storage for the large genomic datasets. Its HPC capabilities allow accelerating the analyses, which could otherwise take months to complete on-premises.
The document discusses the human reference genome assembly. It provides information on what a reference assembly is, how it is constructed, and how it has evolved over time. Key points include:
- The reference assembly is a model of the human genome built from many sequencing reads and is continually improved.
- Early assemblies had gaps and errors that have been improved on in newer releases. The current primary assembly is GRCh38.
- Alternate loci are now included to represent structural and haplotype variations not in the primary assembly.
- The reference assembly is important for mapping variants and interpreting genomic data.
This document summarizes work on generating haplotype phased reference genomes for the wheat stripe rust fungus Puccinia striiformis f. sp. tritici. Key points:
1) Long-read PacBio sequencing was used to generate improved genome assemblies with fewer contigs and the ability to distinguish between the two haplotypes of the dikaryotic fungus.
2) Mapping of the assemblies showed distinct sequences corresponding to the two haplotypes.
3) Future work includes manual curation of the genome assembly, annotating genes and repeats, and investigating the interaction between the two fungal nuclei.
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Fabio Caligaris
Presented at Plant Genomics and Gene Editing Congress: Europe. For more information visit: www.global-engage.com
In a context of climate change and limited energy resources, better understanding of how plants evolve and adapt is a major goal. However, despite the revolution of the NGS technologies, the study of plant genomes remains challenging due to their size, polyploidy and high percentage of repetitive elements.
Next generation sequencing techniques were discussed including an overview of various sequencing platforms, their output, and common analysis workflows. Mapping short reads to reference genomes using alignment programs is a key first step for most applications. Formats like FASTQ, SAM, and BAM are commonly used to store sequencing reads and mapping results.
Open pacbiomodelorgpaper j_landolin_20150121Jane Landolin
Jane Ladolin's slides on Open Data Paper (http://www.nature.com/articles/sdata201445) presented at Balti and Bioinformatics virtual meeting on Jan. 21st 2015. (http://bit.ly/1KYGxr4)
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.
The document provides an overview of plant genome sequence assembly, including:
1) A brief history of sequencing technologies and their improvements over time, from Sanger sequencing to newer technologies producing longer reads.
2) Key steps in a sequencing project including read processing, filtering, and corrections before assembly into contigs and scaffolds using appropriate software.
3) Factors to consider for experimental design and assembly optimization such as sequencing depth, library types, and software choices depending on the genome and data characteristics.
This document summarizes the work of Hans Jansen and Christiaan Henkel with long read nanopore sequencing. They have sequenced several genomes including carp, eel, king cobra, and Agrobacterium using MinION. Their longest reads were 120 kbp and 93.5 kbp. They also established the MinION Access Program to improve genomes by resolving repeats. As part of this, they formed the MinION Analysis and Reference Consortium to standardize protocols and understand variability between labs. Their work with the E. coli genome demonstrated sources of variation in read counts, lengths, and alignments between labs.
Miten Generating high-quality reference human genomes using Promethion nanopo...GenomeInABottle
This document discusses using nanopore sequencing to generate high-quality reference human genomes. It outlines how nanopore sequencing can generate very long reads, including reads exceeding 1 megabase in length, that can help assemble complex genomic regions that have been difficult to assemble with short read sequencing. The document describes ongoing work to sequence the human genomes of two individuals using the PromethION nanopore sequencing device, generating over 60-130 gigabases of sequence data per flow cell with read lengths over 20-30 kilobases on average. Long read nanopore sequencing will play an important role in fully resolving the human genome sequence and understanding DNA modifications.
This document discusses advances in long-read genome assembly methods and their applications in clinical research. It summarizes that:
1) New long-read technologies have enabled assembly of complete telomere-to-telomere human genomes, introducing nearly 200 Mb of new sequence compared to previous references.
2) A human pangenome reference consortium aims to generate complete diploid assemblies of over 350 diverse individuals to improve representation of global genomic diversity.
3) Complete genome assemblies are revealing new variation in medically important regions like those associated with spinal muscular atrophy that were previously difficult to study with short reads.
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...ATMOSPHERE .
Butler is a framework for analyzing thousands of human genomes from cloud infrastructure. It provides functionality for provisioning infrastructure, configuration management, defining and executing workflows, and operational management. The document describes how Butler has been used to analyze over 2800 human genomes from several cancer studies across multiple cloud providers. Key components that improve analysis time include tools for workflow management and operational monitoring.
This document discusses high-throughput DNA sequencing technologies and their application to genome assembly projects. It provides a brief history of DNA sequencing, from early chemical and chain termination methods to current massively parallel sequencing technologies. It also describes several long-read sequencing technologies, including Pacific Biosciences SMRT sequencing and Oxford Nanopore sequencing. Examples are given of genome projects utilizing these technologies along with short-read sequencing data.
Slide Deck from Josh's 2014 presentation at the Illumina user group meeting in RTP. Slides describe our experience with V3 and V4 chemistries on a very large cohort of exome sequenced samples.
Presentation at 2019 ASHG GRC/GIAB workshop describing history of the human reference genome, current curation efforts and future plans, and the relationship of all 3 to efforts to produce a human pan-genome.
Platform presentation at ASHG 2019 describing recent updates to the human reference genome assembly (GRCh38) and future plans with relevance to pan-genomic representations.
1. Karen Miga presented on reaching complete telomere-to-telomere chromosome assemblies using long-read sequencing.
2. The current human reference genome is incomplete, with gaps and unresolved issues remaining.
3. Miga's lab generated a high-quality assembly of chromosome 21 for the CHM13 genome using Oxford Nanopore long reads totaling 155 GB of sequence data.
Presentation at 2019 ASHG GRC/GIAB workshop describing features and recent updates to the vg toolkit, including examples of comparisons to other methods used for alignment and variant detection.
The MANE Project aims to define a standardized set of "representative" transcripts for protein-coding genes by selecting a single "default" transcript (MANE Select) and additional well-supported transcripts (MANE Plus) that are identical between RefSeq and Ensembl. This will provide a consistent annotation across genomic resources. So far, MANE Select covers 67% of the genome and aims to reach 75-80% coverage by the end of the year. While MANE cannot capture full biological complexity, it establishes a baseline for clinical reporting and comparative genomics.
Presentation at PanGenomics in the Cloud Hackathon, run by NCBI at UCSC (https://ncbiinsights.ncbi.nlm.nih.gov/2019/02/06/pangenomics-cloud-hackathon-march-2019/). Presents points to consider about the adoption of a pangenome reference, emphasizing aspects for long-term data management and wide-spread adoption.
This document discusses the Matched Annotation from NCBI and EMBL-EBI (MANE) project. The project aims to define a set of identical reference transcripts between RefSeq and Ensembl/GENCODE for protein-coding genes. The MANE set will include a "Select" transcript that is representative of each gene locus, as well as additional "Plus" and "Extended" transcripts. The methodology involves choosing the most representative transcript based on expression, conservation, and manual curation. 5' and 3' ends will be matched based on CAGE and polyA sequencing data. The project aims to initially match 50% of genes and have the dataset available in late 2018, with the goal of matching 90%
The Locus Reference Genomic (LRG) project provides stable reference sequences for reporting clinically-relevant genetic variants. It aims to address challenges of keeping track of variants over time as genomes and transcripts change. LRG records are manually curated and contain genomic, transcript, and protein sequences selected by experts that do not change versions. They are used to report variants in a standardized way compatible with HGVS nomenclature. While the LRG project is not scalable due to being fully manual, there is potential convergence with the automated but expert-guided MANE Plus project to cover more transcripts. The LRG project also works independently of the GRCh38 human reference genome, creating records for genes where an alternate sequence is preferred.
1) De novo genome assembly and phasing methods can reconstruct complete haplotype information from sequenced reads without relying on a reference genome.
2) Trio binning uses k-mer profiling of parents' genomes to separate child's reads into maternal and paternal bins before assembly, producing fully phased haplotig assemblies.
3) The human pan-genome project aims to build a collection of diverse, high-quality haplotype-resolved genomes from different populations using trio binning to improve representation of genetic variations.
The document discusses challenges in characterizing structural variations across genomes and populations using long-read sequencing technologies. It describes tools developed for accurate mapping and structural variant calling from long reads, including NGMLR for mapping and Sniffles for variant calling. It also presents results of applying these tools to call structural variants in the NA12878 genome from PacBio and Nanopore long reads, detecting many more variants than from short read data. The goal is to better understand variability between genomes and impact on gene regulation.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Presentation by Karen Miga at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on centromere assemblies.
This document discusses how linked-read technology from 10x Genomics now enables routine de novo diploid genome assembly from individual human samples. Linked-reads generate long-range genomic information by barcoding DNA fragments derived from the same long input molecules. This allows for improved physical coverage compared to short reads or synthetic long reads. The document shows how a single 10x Genomics library with only 1 ng of input DNA and modest sequencing can generate high-quality human genome assemblies in 2 days. De novo assembly excels at resolving structurally complex and diverse genomic regions that remain difficult for short-read and alignment-based methods.
This document discusses the Genome in a Bottle Consortium's efforts to develop reference materials and data to evaluate whole genome sequencing performance. It summarizes the release of new reference materials, including additional Genome in a Bottle samples from the Personal Genome Project and microbial genomic DNA standards. The consortium aims to apply principles of metrology to genome analysis by generating extensively characterized reference genomes and associated data that can be used to develop and validate analysis methods.
This document discusses ClinVar, a public archive of interpretations of genetic variants and their relationship to human health and disease. It summarizes that ClinVar integrates information on genetic variants, conditions, interpretations of clinical significance, and evidence from multiple sources. The document also outlines ClinVar's processes for submissions, standardizing data, aggregating interpretations from multiple labs and studies, and making data accessible to users.
How to Control Your Asthma Tips by gokuldas hospital.Gokuldas Hospital
Respiratory issues like asthma are the most sensitive issue that is affecting millions worldwide. It hampers the daily activities leaving the body tired and breathless.
The key to a good grip on asthma is proper knowledge and management strategies. Understanding the patient-specific symptoms and carving out an effective treatment likewise is the best way to keep asthma under control.
These lecture slides, by Dr Sidra Arshad, offer a simplified look into the mechanisms involved in the regulation of respiration:
Learning objectives:
1. Describe the organisation of respiratory center
2. Describe the nervous control of inspiration and respiratory rhythm
3. Describe the functions of the dorsal and respiratory groups of neurons
4. Describe the influences of the Pneumotaxic and Apneustic centers
5. Explain the role of Hering-Breur inflation reflex in regulation of inspiration
6. Explain the role of central chemoreceptors in regulation of respiration
7. Explain the role of peripheral chemoreceptors in regulation of respiration
8. Explain the regulation of respiration during exercise
9. Integrate the respiratory regulatory mechanisms
10. Describe the Cheyne-Stokes breathing
Study Resources:
1. Chapter 42, Guyton and Hall Textbook of Medical Physiology, 14th edition
2. Chapter 36, Ganong’s Review of Medical Physiology, 26th edition
3. Chapter 13, Human Physiology by Lauralee Sherwood, 9th edition
Travel vaccination in Manchester offers comprehensive immunization services for individuals planning international trips. Expert healthcare providers administer vaccines tailored to your destination, ensuring you stay protected against various diseases. Conveniently located clinics and flexible appointment options make it easy to get the necessary shots before your journey. Stay healthy and travel with confidence by getting vaccinated in Manchester. Visit us: www.nxhealthcare.co.uk
Nano-gold for Cancer Therapy chemistry investigatory projectSIVAVINAYAKPK
chemistry investigatory project
The development of nanogold-based cancer therapy could revolutionize oncology by providing a more targeted, less invasive treatment option. This project contributes to the growing body of research aimed at harnessing nanotechnology for medical applications, paving the way for future clinical trials and potential commercial applications.
Cancer remains one of the leading causes of death worldwide, prompting the need for innovative treatment methods. Nanotechnology offers promising new approaches, including the use of gold nanoparticles (nanogold) for targeted cancer therapy. Nanogold particles possess unique physical and chemical properties that make them suitable for drug delivery, imaging, and photothermal therapy.
Computer in pharmaceutical research and development-Mpharm(Pharmaceutics)MuskanShingari
Statistics- Statistics is the science of collecting, organizing, presenting, analyzing and interpreting numerical data to assist in making more effective decisions.
A statistics is a measure which is used to estimate the population parameter
Parameters-It is used to describe the properties of an entire population.
Examples-Measures of central tendency Dispersion, Variance, Standard Deviation (SD), Absolute Error, Mean Absolute Error (MAE), Eigen Value
Know the difference between Endodontics and Orthodontics.Gokuldas Hospital
Your smile is beautiful.
Let’s be honest. Maintaining that beautiful smile is not an easy task. It is more than brushing and flossing. Sometimes, you might encounter dental issues that need special dental care. These issues can range anywhere from misalignment of the jaw to pain in the root of teeth.
5-hydroxytryptamine or 5-HT or Serotonin is a neurotransmitter that serves a range of roles in the human body. It is sometimes referred to as the happy chemical since it promotes overall well-being and happiness.
It is mostly found in the brain, intestines, and blood platelets.
5-HT is utilised to transport messages between nerve cells, is known to be involved in smooth muscle contraction, and adds to overall well-being and pleasure, among other benefits. 5-HT regulates the body's sleep-wake cycles and internal clock by acting as a precursor to melatonin.
It is hypothesised to regulate hunger, emotions, motor, cognitive, and autonomic processes.
Travel Clinic Cardiff: Health Advice for International TravelersNX Healthcare
Travel Clinic Cardiff offers comprehensive travel health services, including vaccinations, travel advice, and preventive care for international travelers. Our expert team ensures you are well-prepared and protected for your journey, providing personalized consultations tailored to your destination. Conveniently located in Cardiff, we help you travel with confidence and peace of mind. Visit us: www.nxhealthcare.co.uk
Summer is a time for fun in the sun, but the heat and humidity can also wreak havoc on your skin. From itchy rashes to unwanted pigmentation, several skin conditions become more prevalent during these warmer months.
2. The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
6. Genome Status
Data
Source
Origin Assembly
Accession
Status
CHM1 NA GCA_001297185.1 Assembly Improvement
CHM13 NA GCA_000983455.2 Assembly Assessment
NA19240 Yoruban GCA_001524155.4 Chr-level Assembly Submitted
HG00733 Puerto Rican GCA_002208065.1 Contig Assembly Submitted
HG00514 Han Chinese GCA_002180035.1 Contig Assembly Submitted
NA12878 European GCA_002077035.2 Chr-level Assembly Submitted
HG01352 Columbian GCA_002209525.1 Contig Assembly Submitted
HG02818 Gambian Assembly Underway
HG02059 Kinh-Vietnamese Assembly Assessment
NA19434 Luhya Assembly Assessment
HG04217 Telugu Data Production Underway
HG03486 Mende Assembly Underway**
** First Sequel only data set
8. Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies
Submit conitg
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly
22. Falcon Assembly of NA12878 in CYP2D6 Region
CYP2D8
CYP2D7
CYP2D6
Alignment of
NA12878 to
GRCh38
Region of NA12878 that
doesn’t exist in GRCh38
Shows Duplication of
CYP2D7 gene in
NA12878 genome
26. 10X Data – Separating a Heterozygous Allele
GRCh38
NA12878
Falcon
10X Allele 1
10X Allele 2
Heterozygous SV identified by Bionano
10X Supernova assembly used - GCA_002022845.1
27. Short Term Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon Unzip assemblies for all
samples
• Improve those assemblies
• Identifying misassemblies
• Making the breaks where needed
• Scaffolding the assemblies
• Incorporating BACs as they are finished
• Create Chromosomal AGPs
• Submit to Genbank
28. Longer Term Future Work
• Better Utilization of the Reference
• Mapping Strategies
• Graph based alignments
• Other alt-aware read mapping strategies
• Alternative reference data display challenges – How should we
present data
• Do we continue the current scheme of alt alleles?
• Full reference sequences?
• 2 Haplo-resolved sequences for each allele
• Using Falcon unzip
• Using 10X
• Other technologies?
29. Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Nick Sisneros
Sarah Kingan
Luke Hickey
Greg Concepcion
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath
Editor's Notes
As part of our work as a member of the Genome Reference Consortium, we have been working to improve the current reference, GRCh38. In doing this work, we have found that there are still a few regions of the genome not fully resolved. There are still a few genes that are not optimally represented for all individuals or ancestries, although we have fixed quite a few of them in GRCh38. The reference is comprised of many individuals, so there are regions where allelic diversity and structural variation present challenges in the assembly. Many of the newer technologies have allowed for improvements to the reference genome. But we realize that there still is a need for additional high quality human genome assemblies to fully cover the range of genetic diversity in humans.
This is a great example of how allelic differences can cause assembly problems. The gene UGT2B17 is known to be copy number variant. Some individuals have 1 copy of this gene and other individuals lack this gene altogether. In Build 36, clones in this regions were all from the RP11 sample, which happened to be heterozygous for this indel.The blue, red, and black colored boxes represent the clone path through the region - the yellow blocks indicates annotated segmental duplication, and there were two genes annotated in this region. In Build 36 we were representing both the insertion and deletion alleles in the assembly.
By removing the black clones from the path, we were able to close the gap, then we created an alternate allele from the black clones, which required sequencing one additional clone. To end up with both the insertion and deletion alleles in GRCh37.
This changes our understand of the biology of this region, we closed the gap that existed, we removed falsely annotated duplication and there is really only one copy of this gene present in the assembly with an allelic variant.
This example shows how multiple haplotypes in the assembly can cause problems
In the past few years we have been working on a project funded by NIH to sequence additional human reference genomes. These are the samples we have been working on. Originally we planned to sequence 5 diploid genomes and 2 haploidgenomes. Currently we are working on our 10th diploid genome. These genomes will help to add diversity to the reference.
As part of this project, we are generating ~60X coverage of PacBio long read data. We will do a de novo assembly of that PacBio data. Then we are using a variety of additional tools to help inform the assembly. BioNano has been very useful in helping to scaffold the assembly as well as to identify potential mis-assemblies. We are also starting to work with 10X genomics data as well. For the initial few genomes, we were targeting difficult to assemble regions of the genome by sequencing BACs. Once the BACs are incorporated, we plan to align all of this data to the Reference very stringently to produce chromosomal AGPs. The end product will be a very high quality whole genome assembly.
To date, data has been generated for 2 Haploid genomes and 10 diploid genomes, all at ~60X coverage or higher. We have a lot of data and a lot of assemblies to work with. For 2 of the diploid genomes, we have Chromosome level assemblies, the rest are at the contig leve.
**2 additional genomes – data will be generated soon
Here are the assembly stats we have for all of the genomes we have assembled to date. All of these genomes are being assembled using Falcon. With the newer version of Falcon, we are seeing a huge increase in contiguity. In most cases, the N50 has increased by 3 times. FALCON-integrate 1.7.5, Various assemblies are generated, minimum seed read lengths and min_cov
We generate multiple assemblies, varying the minimum seed read length and min_cov. From those 20 or so assemblies, we the Raw data is generally submitted a month or so after production of the data is completed
This diagram shows the work flow for the Bionano Irys system. It is a nanochannel technology where long DNA molecules are nicked and labeled at specific recognition sites, you end up with nick sites along the DNA molecules, similar to a restriction digest, only you have the added benefit of the nicks being in context to one another. Once the data is assembled, the resulting Genome maps can be used for SV detection, gap sizing, assembly QC, and scaffolding
Here is an example of one of the BioNano hybrid scaffolds that was generated The top line in green represents the Hybrid scaffold, the first set of Blue bars represents the PacBio assembly and the S2 lines represent the BioNano map contigs. So from this you can see how these two technologies are very complementary to one another. This is a snapshot of a 10 Mb portion of a larger hybrid scaffold
BioNano has also identified a second enzyme that nicks well for human genomes. You can create a second map with the other enzyme and then through softtware improvements that are coming in the next month, will be able to align you sequence to both maps. This will increase the N50 by 2 times.
used 14k_120_120_1
Once we identified which assembly version we wanted to improve, we aligned to BioNano, SV calls were generated as well as doing hybrid scaffolding. During the hybrid scaffolding process, conflicts are identified. For this genome, 51 contflicts were identified. We looked at the sequence alignments for all of these conflicts and found 35 to be pacbio assemblie errors. WE also looked through the translocation and complex SV calls, as well as a rough alignment of the assembly to GRCh38 to identify contigs that crossed chromosomes. From looking through all of this data, 69 breaks were done. You will see that breaking the obvious chimeric contigs only brought the N50 down a little bit to 25.7 Mb.
Sequence alignments were looked at for all conflicts, then to narrow down the complex and translocations first looked at the BioNano alignments in Irysview
This is the same Pacbio contig as in the last slide, only this time, it is comparing the pacbio contig to GRCh38, it in the top panel you can see
We have also been using the bionano maps to identify variation between our genomes and the reference. In this example, there are 2 haplotypes in BN compared to GRCh38 – This appears to be a heterozygous inversion in NA19240.
Here is a list of initial set of SV calls of our genomes when compared to GRCh38. These contain both homozygous and heterozygous calls.
I have a few examples of what we have been seeing in these assemblies. We decided to take a look at the MHC region, of NA19240. This is a comparison of the BioNano map of NA19240 to the reference, the reference is in green and the NA19240 BN map in blue. It looks like from the BN map there is a ~65kb insertion.
We then aligned the contig from Jason’s most recent assembly to the current reference as well as the alts. This is the region that cooresponds to the insertion in the BN map, so from this initial look, it appears there is an insertion here in this assembly. Need to look at it further to evaluate if this would be a useful addition to the alts that already are present.
CYP2D6 is a very diverse genomic region that has implications on drug metabolism. In collaboration with the Pharmaco Genomics Research Network (PGRN), we have sequenced multiple alleles in this region using fosmid libraries created from ethnically diverse individuals. Within the region, there is also another Cyp gene, CYP2D7 and a pseudogene called CYP2D8 that contain with common repeats interspersed between genes and pseudogene copies, facilitating genomic rearrangements. The gene CYP2D6 and the associated pseudo genes are shown here, along with some of the different alleles we have sequenced.
This is the alignment of NA12878 to GRCh38 as well as the genes aligned to the NA12878
IT was important, especailly in highly variable regions of the gneome to capture both alleles from the diploid samples. In collaboration with Pacbio, they have generated an unzip assembly for us. Here is a diagram showing how with Falcon you will be missing allelic variation, but by using Falcon unzip, you should capture the variation that is present. You end up with a set of very contiguous primary contigs and then a set of smaller haplotigs that contain the variation.
Gambian assembly was done at Pacbio for us and this version is polished
I want to acknowledge all of the collaborators on this project and all of the work that has gone into it thus far.