GRC Workshop at Churchill College on Sep 21, 2014. This is Paul Kitt's talk describing the NCBI approach to annotation the full human reference assembly.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Transcriptomics is the study of RNA, single-stranded nucleic acid, which was not separated from the DNA world until the central dogma was formulated by Francis Crick in 1958, i.e., the idea that genetic information is transcribed from DNA to RNA and then translated from RNA into protein.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Transcriptomics is the study of RNA, single-stranded nucleic acid, which was not separated from the DNA world until the central dogma was formulated by Francis Crick in 1958, i.e., the idea that genetic information is transcribed from DNA to RNA and then translated from RNA into protein.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
With the DNA sequences of more than 90 genomes completed, as well as a draft sequence of the human genome, a major challenge in modern biology is to understand the expression, function, and regulation of the entire set of proteins encoded by an organism—the aims of the new field of proteomics. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. The term proteomics describes the study and characterization of a complete set of proteins present in a cell, organ, or organism at a given time.
In general, proteomic approaches can be used (a) for proteome profiling, (b) for comparative expression analysis of two or more protein samples, (c) for the localization and identification of posttranslational modifications, and (d) for the study of protein-protein interactions. The human genome harbours 26000–31000 protein-encoding genes; whereas the total number of human protein products, including splice variants and essential posttranslational modifications (PTMs), has been estimated to be close to one million. It is evident that most of the functional information on the genes resides in the proteome, which is the sum of multiple dynamic processes that include protein phosphorylation, protein trafficking, localization, and protein-protein interactions. Moreover, the proteomes of mammalian cells, tissues, and body fluids are complex and display a wide dynamic range of proteins concentration one cell can contain between one and more than 100000 copies of a single protein.
A rapidly emerging set of key technologies is making it possible to identify large numbers of proteins in a mixture or complex, to map their interactions in a cellular context, and to analyze their biological activities. Mass spectrometry has evolved into a versatile tool for examining the simultaneous expression of more than 1000 proteins and the identification and mapping of posttranslational modifications. High-throughput methods performed in an array format have enabled large-scale projects for the characterization of protein localization, protein-protein interactions, and the biochemical analysis of protein function. Finally, the plethora of data generated in the last few years has led to approaches for the integration of diverse data sets that greatly enhance our understanding of both individual protein function and elaborate biological processes.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Apollo — Collaborative and Scalable Manual Genome AnnotationNathan Dunn
Manual curation is crucial to improving the quality of the annotations of a genome. It enables curators to refine automated gene predictions using experimental data and aligned predictions from closely related organisms to more accurately represent the underlying biology. Apollo is a web-based genome annotation editor that allows curators to manually revise and edit the structure and function of predicted genomic elements.
Apollo, built on top of the JBrowse genome browser, offers an ‘Annotator Panel’ that allows users to efficiently navigate the genome and its annotations. Changes are reflected in real-time to all users (similar to Google Docs) and aggregated in a revertible, visual history of structural edits. Apollo allows the export of sequences and metadata associated with each annotated genomic element in FASTA, GFF3, or Chado. A single Apollo server can be scaled to support multiple genome projects and curators. Access to genomes is controlled with fine-grained permissions (e.g. administrator, curator, public). To support integration into larger workflows, we expose the suite of web services that drives user-interface functionality. These web- services have been leveraged to integrate with Docker and the Galaxy platform.
Striving to increase Apollo’s repertoire of visual exploration and exploratory analytics tools, two major undertakings are currently under development. First, the ability to visualize variant data and to annotate their predicted effects, primarily on coding regions. New technology trends and scientific paradigms point to new needs in genomic analytic tools to leverage information about variants that impact human health. Driven by a growing need to identify disease causing variants across diverse groups, we are working towards providing full functionality in genomic variant analysis and curation. Second, is the transformation of separate genomic coordinates into a single, synthetic region. This will allow the visualization of two or more genomics regions, from the length of entire chromosomes to just a few exons, within an artificially constructed genomic region. Artificially joining scaffolds facilitates annotation of genomic features split across two or more regions of a fragmented assembly (e.g, scaffolds), likely informing potential improvements to the genome assembly in the process. Additionally, this will allow hiding (visual genome folding) intra- and intergenic regions to provide a more information-rich visualization of the genome. For example, bringing exons closer together will facilitate annotating gene models with long introns, as sequences at the edge of exons separated by thousands of base-pairs will be shown adjacently.
Apollo is currently being used in over one hundred genome annotation projects around the world, ranging from annotation of a single species to lineage-specific efforts supporting the annotation of dozens of species at a time.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
With the DNA sequences of more than 90 genomes completed, as well as a draft sequence of the human genome, a major challenge in modern biology is to understand the expression, function, and regulation of the entire set of proteins encoded by an organism—the aims of the new field of proteomics. This information will be invaluable for understanding how complex biological processes occur at a molecular level, how they differ in various cell types, and how they are altered in disease states. The term proteomics describes the study and characterization of a complete set of proteins present in a cell, organ, or organism at a given time.
In general, proteomic approaches can be used (a) for proteome profiling, (b) for comparative expression analysis of two or more protein samples, (c) for the localization and identification of posttranslational modifications, and (d) for the study of protein-protein interactions. The human genome harbours 26000–31000 protein-encoding genes; whereas the total number of human protein products, including splice variants and essential posttranslational modifications (PTMs), has been estimated to be close to one million. It is evident that most of the functional information on the genes resides in the proteome, which is the sum of multiple dynamic processes that include protein phosphorylation, protein trafficking, localization, and protein-protein interactions. Moreover, the proteomes of mammalian cells, tissues, and body fluids are complex and display a wide dynamic range of proteins concentration one cell can contain between one and more than 100000 copies of a single protein.
A rapidly emerging set of key technologies is making it possible to identify large numbers of proteins in a mixture or complex, to map their interactions in a cellular context, and to analyze their biological activities. Mass spectrometry has evolved into a versatile tool for examining the simultaneous expression of more than 1000 proteins and the identification and mapping of posttranslational modifications. High-throughput methods performed in an array format have enabled large-scale projects for the characterization of protein localization, protein-protein interactions, and the biochemical analysis of protein function. Finally, the plethora of data generated in the last few years has led to approaches for the integration of diverse data sets that greatly enhance our understanding of both individual protein function and elaborate biological processes.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
After sequencing of the genome has been done, the first thing that comes to mind is "Where are the genes?". Genome annotation is the process of attaching information to the biological sequences. It is an active area of research and it would help scientists a lot to undergo with their wet lab projects once they know the coding parts of a genome.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Apollo — Collaborative and Scalable Manual Genome AnnotationNathan Dunn
Manual curation is crucial to improving the quality of the annotations of a genome. It enables curators to refine automated gene predictions using experimental data and aligned predictions from closely related organisms to more accurately represent the underlying biology. Apollo is a web-based genome annotation editor that allows curators to manually revise and edit the structure and function of predicted genomic elements.
Apollo, built on top of the JBrowse genome browser, offers an ‘Annotator Panel’ that allows users to efficiently navigate the genome and its annotations. Changes are reflected in real-time to all users (similar to Google Docs) and aggregated in a revertible, visual history of structural edits. Apollo allows the export of sequences and metadata associated with each annotated genomic element in FASTA, GFF3, or Chado. A single Apollo server can be scaled to support multiple genome projects and curators. Access to genomes is controlled with fine-grained permissions (e.g. administrator, curator, public). To support integration into larger workflows, we expose the suite of web services that drives user-interface functionality. These web- services have been leveraged to integrate with Docker and the Galaxy platform.
Striving to increase Apollo’s repertoire of visual exploration and exploratory analytics tools, two major undertakings are currently under development. First, the ability to visualize variant data and to annotate their predicted effects, primarily on coding regions. New technology trends and scientific paradigms point to new needs in genomic analytic tools to leverage information about variants that impact human health. Driven by a growing need to identify disease causing variants across diverse groups, we are working towards providing full functionality in genomic variant analysis and curation. Second, is the transformation of separate genomic coordinates into a single, synthetic region. This will allow the visualization of two or more genomics regions, from the length of entire chromosomes to just a few exons, within an artificially constructed genomic region. Artificially joining scaffolds facilitates annotation of genomic features split across two or more regions of a fragmented assembly (e.g, scaffolds), likely informing potential improvements to the genome assembly in the process. Additionally, this will allow hiding (visual genome folding) intra- and intergenic regions to provide a more information-rich visualization of the genome. For example, bringing exons closer together will facilitate annotating gene models with long introns, as sequences at the edge of exons separated by thousands of base-pairs will be shown adjacently.
Apollo is currently being used in over one hundred genome annotation projects around the world, ranging from annotation of a single species to lineage-specific efforts supporting the annotation of dozens of species at a time.
Talk at the Salk Institute's 2012 Systems to Synthesis Symposium. Discusses the use of online games with the purpose of annotating the human genome and building better phenotype predictors.
Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
Fly chromatin dynamics using bidirectional hidden markov modelSanju K. Sinha
Analysis of various Chromatin states like Promoter, Enhancer, early gene etc using a HMM model(BDHMM).
This HMM model can include the specific trait of a state, "Direction", which makes this HMM special and could help us find interesting discoveries.
Here, We have developed a simple computational model to find unstable transcription via Contiguous States combination. :)
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...MANGLAM ARYA
Single Nucleotide Polymorphism
Single nucleotide polymorphism (SNP) refers to a single base change in a DNA sequence
SNP: Commonly biallelic
Two types(Based on presence in genome)
Synonymus
Non-synonymus
SNPs have largely replaced simple sequence repeats (SSRs)
Advantage of using SNPs
Low assay cost
High genomic abundance
Locus specificity
co-dominant inheritance
Simple documentation
Potential for high-throughput Analysis
Relatively low genotyping error rates
SNP genotyping platforms
BeadXpressTM,GoldenGateTM and Infinium from Illumina
GeneChipTM and GenFlexTM Tag array from Affimetrix
SNaPshotTM and TaqManTM from the Applied Biosystems
SNPWaveTM from KeyGene
iPLEX GoldTM Assay and Mass-RRAYTM from Sequonome
Variables to be considered
Throughput
Data turnaround
Time
Ease of use
Performance (sensitivity, reliability, reproducibility, and accuracy),
Flexibility (genotyping few samples with many snps or many samples with few snps),
Number of markers generated per run (uniplex versus multiplex assay capability)
Assay development requirements and genotyping cost per sample or data point.
KASP
KBioscience Competitive Allele-Specific PCR
Homogenous, Fluorescence-based genotyping technology, based on
Allele-specific oligo extension (primer)
Fluorescence resonance energy transfer
KASP Applications
Genotyping a wide range of species for various purposes.
KASP for Quality analysis, QTL mapping, MARS, and allele mining
Quality Control Analysis
QC analysis should be done for two reasons by genotyping the parents and F1s with the same subset of SNPs, in order to
confirm if F1s contains true-to-type alleles from their parents
check the genetic purity of the inbred parents.
F1s with true-to-type parental alleles for at least 90 % of the SNPs that were polymorphic between the parents should be advanced, while those with less than 10 % nonparental alleles should be discarded.
QTL Mapping
QTL mapping identifies a subset of markers that are significantly associated with one or more QTL influencing the expression of the trait of interest.
1) Select or develop a bi-parental mapping population.
2) Phenotype the population for a trait under greenhouse or field conditions.
3) Choose a molecular marking system – genotype parents of the mapping population and F1s with large numbers of markers, then select 200-400 markers exhibiting polymorphism between the parents.
4) Choose a genotyping approach, then generate molecular data for polymorphic markers
5) Identify the molecular markers associated with major QTL using statistical programs.
Large-scale allele mining
Allele mining is a promising approach to dissecting naturally occurring allelic variation at candidate genes controlling key agronomic traits.
KASP platform at CIMMYT has been used for the systematic mining of large germplasm collections for specific functional polymorphisms.
SNPs or small indels that
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
Many questions must be answered when analyzing DNA sequence variants: How do I determine which variants are potentially deleterious? Is the sequencing quality sufficient? How do I prioritize the results? Which annotation sources may help answer my research question?
In this webinar presentation, we will review workflow strategies for quality control and analysis of DNA sequence variants using the VarSeq software package from Golden Helix. VarSeq is a powerful platform for analysis of DNA sequence variants in clinical and translational research settings. VarSeq provides researchers with easy access to curated public databases of variant annotation information, and also enables users to incorporate their own local databases or downloaded information about variants and genomic regions.
The presentation will include interactive demonstrations using VarSeq to analyze variants found by exome sequencing of an extended family with a complex disease. We will review strategies for assessing variant quality, applying genomic annotations, incorporating custom annotation sources, and creating variant filters in VarSeq. We will also demonstrate the PhoRank gene ranking algorithm and its application for prioritizing variants.
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
In this webinar presentation, we will review workflow strategies for quality control and analysis of DNA sequence variants using the VarSeq software package from Golden Helix. VarSeq is a powerful platform for analysis of DNA sequence variants in clinical and translational research settings. VarSeq provides researchers with easy access to curated public databases of variant annotation information, and also enables users to incorporate their own local databases or downloaded information about variants and genomic regions.
Presentation at IMGC 2019 workshop describing the latest improvements to the mouse reference genome assembly and analyses performed in preparation for the next release of the mouse genome assembly (GRCm39).
Presentation at 2019 ASHG GRC/GIAB workshop describing history of the human reference genome, current curation efforts and future plans, and the relationship of all 3 to efforts to produce a human pan-genome.
Platform presentation at ASHG 2019 describing recent updates to the human reference genome assembly (GRCh38) and future plans with relevance to pan-genomic representations.
Presentation at 2019 ASHG GRC/GIAB workshop describing goals and progress of the telomere-to-telomere consortium to generate a genome assembly that provides representation of all sequences, including repetitive regions.
Presentation at 2019 ASHG GRC/GIAB workshop describing features and recent updates to the vg toolkit, including examples of comparisons to other methods used for alignment and variant detection.
Presentation at 2019 ASHG GRC/GIAB workshop describing recent updates to the MANE project, which aims to provide matched annotation from RefSeq and GENCODE.
Presentation at PanGenomics in the Cloud Hackathon, run by NCBI at UCSC (https://ncbiinsights.ncbi.nlm.nih.gov/2019/02/06/pangenomics-cloud-hackathon-march-2019/). Presents points to consider about the adoption of a pangenome reference, emphasizing aspects for long-term data management and wide-spread adoption.
Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Presentation by Valerie Schneider at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
Presentation by Tina Graves-Lindsay at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on production of reference grade assemblies for various human populations.
Presentation by Fritz Sedlazeck at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on characterizing human structural variation.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Presentation by Karen Miga at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on centromere assemblies.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfmediapraxi
The rise of virtual labs has been a key tool in universities and schools, enhancing active learning and student engagement.
💥 Let’s dive into the future of science and shed light on PraxiLabs’ crucial role in transforming this field!
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
1. GRC Assembly Analysis Workshop At Genome Informatics
September 21, 2014
The NCBI Eukaryotic Genome Annotation Pipeline
And Alternate Genomic Sequences
Paul Kitts
NCBI
National Center for Biotechnology Information
2. Genomes Annotated By NCBI
Human GRCh38
2014-02-03
Zebrafish GRCz10
in progress
Mouse GRCm38.p2
2013-12-27
3. Outline
• Overview of the NCBI Eukaryotic Genome Annotation Pipeline
• What to do with alternate loci & patch scaffolds?
• How we use the alt/patch/PAR alignments to inform our annotation
• Examples:
– Annotation only on alternate loci
– Different alleles annotated on primary assembly and alternate loci
– Annotation improved by patches
– Pseudoautosomal Regions annotated consistently on X & Y
• Recent enhancements:
– Using RNA-Seq evidence for gene prediction
– Gap-filling gene models using transcript sequences
– Annotation reports
5. Ranking Alignments
• Rank alignments for each query sequence
– using a quality score that combines identity & coverage
– Rank-1 > Rank-2 > Rank-3…
• Conflicting alignments cannot have same rank
– alignments of the same query sequence to an assembly
conflict if they have significant overlap (>= 30%)
– Insignificant
– Significant
• A subset of rank-1 alignments is used for annotation
Span in alignment B
Span in alignment A
Span in alignment B
Span in alignment A
6. mRNA-F1
Annotation Of A Simple Assembly Using Ranked Alignments
mRNA-F1
mRNA-F2
Input mRNAsGenes in the assembly
mRNA-F2
Unplaced scaffold1
mRNA-F1
Filter out alignments that are not rank-1
GeneF1 GeneF2Chr1
GeneF1Chr1
Resulting annotation
GeneF2 Unplaced scaffold1
mRNA-F2 mRNA-F1
* * **
* * *mRNA-F1mRNA-F2
* *
Rank alignments
Unplaced scaffold1
GeneF2Chr1 GeneF1
Rank-1
Rank-2
Rank-3Rank-1
Rank-2
Align mRNAs
Unplaced scaffold1GeneF1 GeneF2Chr1
7. What to do with alternate loci & patch scaffolds?
1. Omit the alternate loci & patch scaffolds
2. Include the alternate loci & patch scaffolds;
no special treatment
3. Include the alternate loci & patch scaffolds;
use known relationships to primary assembly
8. Gene1/A G2-Allele-APrimary Chr1
Resulting annotation
Gene2
mRNA-3A
* * *
Annotation Omitting Alt-scaffolds
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
mRNA-3A
Gene3
Scenario 1: no annotation for Gene3
no annotation for Gene1/Allele-B
✔
mRNA-1A
Rank-1 mRNA alignments
Gene1/A Gene2Primary Chr1
mRNA-2A
✗✔
Scenario 2: Gene3 annotated at the wrong location
no annotation for Gene1/Allele-B
11. Pros & cons of different choices for dealing with
alternate loci & patch scaffolds
1. Omit the alternate loci & patch scaffolds
Pros: Easy to implement
Cons: No representation for genes or alleles only on alts.
Incorrect models for genes that have been patched.
2. Include the alternate loci & patch scaffolds;
no special treatment
Pros: Easy to implement
Cons: Incorrectly annotate genes that have alternate alleles or patches
as if they were paralogs.
Wrongly penalize sequences for having multiple or ambiguous
placements.
3. Include the alternate loci & patch scaffolds;
use known relationships to primary assembly
Pros: Genes only on alts are annotated.
Correctly annotate genes with alternate alleles.
Correctly annotate patched genes
Cons: Requires software and pipelines changes
13. Ranking Alignments Across Assembly Units
• Create graph of related alignments
– Alignments that are collocated or mappable
– Transcript/protein to genomic
– Alt or patch scaffold to primary assembly
• Partition graph into clusters
– Each alignment in the cluster is related to at least one other
alignment in the same cluster
– No alignment is related to any alignment in another cluster
– Split conflicting alignments within a cluster into separate groups
– Merge non-conflicting clusters into groups
• Evaluate groups, sort and assign ranks
– All alignments in a group get the same rank
14. Ranked Alignment Groups Across Assembly Units
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
15. Ranked Alignment Groups Across Assemblies
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
Assembly2-Primary
Assembly3-Primary
16. Ranked Alignment Groups Across
Pseudoautosomal Regions (PARs)
Chromosome
PAR alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Chromosome Y
Chromosome X
Rank-1
PAR#1 PAR#2
17. NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter]
Genes Only Annotated On GRCh38 Alternate Loci
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop]
Num. Gene Type
20 Protein Coding
40 Protein Coding (model)
21 Pseudogene
32 Pseudogene (model)
32 ncRNA (model)
5 Other
3 Other (model)
18. Different Alleles Annotated On GRCh38
Primary Assembly And Alternate Loci
ALT_REF_LOCI_2
ALT_REF_LOCI_7
NM_001243042.1 comment:
This variant represents the C*07:01:01:01 allele of the HLA-C gene.
NM_002117.5 comment:
This variant represents the C*07:02:01 allele of the HLA-C gene.
19. Annotation Of GRCh37 Improved By Patch Scaffold
EPPK1 gene on primary assembly chromosome 8 has an internal deletion.
EPPK1 gene on patch scaffold is complete.
Primary Assembly chromosome 8
Patch scaffold HG104_HG975_PATCH
21. Recent Enhancements To The Genome Annotation Pipeline:
#1 Using RNA-Seq Evidence For Gene Prediction
0
10000
20000
30000
40000
50000
60000
70000
80000
Number of coding transcripts
predicted +/- RNA-Seq
0
10000
20000
30000
40000
50000
60000
Number of genes
predicted +/- RNA-Seq
Without RNA-Seq
With RNA-Seq
75 organisms annotated with RNA-Seq data
22. Example Of Tracks Made Using RNA-Seq Data
NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]
23. Recent Enhancements To The Genome Annotation Pipeline:
#2 Gap-filling Gene Models Using Transcript Sequences
Genomic sequence
Transcript alignment
1 32 4
RefSeq model
Gap
How gap-filling works
Reporting of gap-filled regions
25. Summary
Including the alternate loci & patch scaffolds and
using their known relationships to the primary
assembly significantly improves the annotation of
GRC assemblies.
It is worth the extra effort!
26. CREDITS
Genome pipeline infrastructure
Alex Astashyn
Nathan Bouk
Rob Cohen
Mike Dicuccio
Eric Engelson
Olga Ermoloeva
Wratko Hlavina
Lucian Ion
Avi Kimchi
Boris Kiryutin
David Managadze
Eyal Mozes
Terence Murphy
Daniel Rausch
Robert Smith
Sasha Souvorov
Craig Wallin
Alex Zasypkin
Eukaryotic annotation
setup & execution
Françoise Thibaud-Nissen
Jinna Choi
Patrick Masterson
Kim Pruitt and the “genome champions”
from the RefSeq group
Genomic Collections DB
Avi Kimchi
Victor Sapojnikov
Charlie Xiang
Andrey Zherikov
Genome assemblies with alt/patch to primary alignments
Genome Reference Consortium
The Wellcome Trust Sanger Institute
The Genome Institute at Washington University
The European Bioinformatics Institute
The National Center for Biotechnology Information
Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/
Editor's Notes
NCBI developed a genome annotation pipeline 14 years ago to annotate draft versions of the human genome assembly. Since then we have continued to develop and improve the annotation pipeline and to apply it to more and more genomes. So far, we have annotated the genomes for over 150 different eukaryotes, >> we last annotated the GRC’s mouse assembly in December, the GRC’s human assembly in February, and our annotation of the GRC’s new zebrafish assembly is humming along as I speak.
Since the theme of this workshop is using the full GRC assemblies, including the alternate loci and patch scaffolds, and making use of the known relationship between these scaffolds and the primary assembly, the focus of my talk will be on how NCBI uses this information in our annotation pipeline.I will begin by giving an overview of the NCBI eukaryotic genome annotation pipeline.Then I will raise the question of what to do with alternate loci and patch scaffolds, sketching out the consequences of some different choices.I will explain how we use the alt, patch, & PAR alignments to inform our annotation, and then show some examples of the results.I will finish by briefly highlighting some recent enhancements to our annotation pipeline.
Here is a flowchart of our genome annotation pipeline. The substrate for annotation is one or more genome assemblies (in grey). The genomic sequences are masked, and transcripts (blue), proteins (green) and RNA-Seq reads (orange) are aligned to the genome. If available for the organism being annotated, curated RefSeq genomic sequences (pink) are also aligned.The alignments of these different types of evidence go through a ranking step, which I will describe in more detail later, and a filtering step, before being fed into the gene model prediction step.The best models are selected from those known RefSeqs that aligned to the genome and from the predicted models.The selected models are then assigned to genes and named.After that, the annotation products are formatted and deployed to NCBI’s public resources: Nucleotide, Protein, Gene, BLAST & FTP.>> In order to talk about how we handle alt and patch scaffolds in our pipeline, I first need to do give you more detail on how the alignments are ranked >>
We rank alignments for each query sequence using a quality score that combines identity & coverage. So the best alignment gets rank-1, the next best alignment gets rank-2, and so on.… Some rank-1 alignments may not be used: because even rank-1 alignments may not be good enough quality, because curated input may exclude a transcript from being placed on certain assemblies or on certain chromosomes, or because the alignments may be rejected later at the best model selection step.
I am now going to show a carton of how we annotate a simple assembly using ranked alignments.The genomic sequence is shown in orange, with the locations of a couple of genes from a gene family shown in grey. We have mRNA sequences corresponding to these genes shown in green.>> We start by aligning the mRNAs. Alignments are shown in blue, with red stars indicating mismatches. The F1 mRNA aligns not only to the gene from which it is derived, but it also aligns to other genes in the same family, and even to less related sequences. >> the F2 mRNA also aligns to multiple locations. >> So then we rank the alignments of mRNA-F1, >> and the alignments of mRNA-F2.>> We filter out alignments that are not rank-1. >> Keeping only the rank-1 alignments allows us to correctly annotate the two genes in this family.So you see that annotating a simple assembly is pretty straightforward, at least in this cartoon! >> so then the question is…
Here I list three choices.… I will now sketch out the consequences of each of these three choices for gene annotation.
…>> align the mRNAs, rank them, and filter out the non-rank-1 alignments.>> The resulting annotation is good, with Gene1/Allele-A & Gene2 correctly annotated, but incomplete. Our annotation release would be missing Gene3 & Allele-B of Gene1. >> neither we nor our users would be happy.Since the true home for mRNA-3A is only present on one of the alt loci scaffolds that was omitted, one of two things could happen. In the first scenario I showed you, mRNA-3A fails to align anywhere. >> An alternative scenario is that mRNA-3A aligns imperfectly to another gene or pseudogene. Because the true home for this mRNA is missing, this alignment would get assigned rank-1. >> as a result Gene3 would be annotated at the wrong location. >> again, neither we nor or users would be happy.
…>> after aligning the mRNAs, ranking them, and filtering out the non-rank-1 alignments, the picture would look like this…>> using these rank-1 alignments for annotation, correctly annotates Gene2, Gene3 and Gene1/Allele-A. But Allele-B of Gene1 on the alt-scaffold is incorrectly annotated as being a different gene because we had nothing to relate alt-scaffold1 to the primary assembly, and consequently failed to recognize that these were variants of the same gene. >> we would not feel good about putting out such annotation.
…In some other pipelines, a sequence that aligns to equivalent locations on both the primary assembly and on an alt or patch scaffold may be penalized for having multiple or ambiguous placements.
The NCBI Eukaryotic Genome Annotation pipeline has been engineered to use alt-to-primary alignments in two steps: first, in the ranking of transcript & protein alignments; and second in gene assignment & naming.>> We also use input on gene localization to particular chromosomes or assembly-units that our curators maintain.The challenge is how to do ranking across assembly-units. >>
Here is an outline of our algorithm for doing this.
…
All the alignments in the group on the right get assigned rank-1. This is what enables us to annotate a gene consistently when it appears in equivalent locations on more than one assembly unit.
Here I have added two additional assemblies into the picture, along with assembly-assembly alignments that relate segments of assembly 1 to assembly 2 or assembly 3.
>> the algorithm used to cluster related alignments and rank them can be applied to this more complex picture. This allows us to annotate the same gene consistently across multiple assemblies. For example in our most recent human annotation run, we annotated the HuRef and CHM1 assemblies as well as GRCh38.
The algorithm used to cluster related alignments and rank them is also applicable to pseudoautosomal regions. In this case, alignments between the pseudoautosomal regions on chromosome X and chromosome Y are used to cluster transcript alignments for ranking.Enough of the theory. How do we do in practice?
The NCBI Gene resource tags genes only annotated on the alts with the phrase “only annotated on alternate loci in reference assembly”, which makes it easy to retrieve this set of genes. Running this query shows that there are 153 genes only annotated on the alts.>> the search can be further constrained to just the protein coding genes from known RefSeqs, by adding terms to the query.>> Doing variations of this search shows that the genes only annotated on the alts include 20 protein coding genes with known RefSeqs, another 40 protein coding gene models, and a number of pseudogenes, non-coding RNAs and genes of other types.
My next example is different alleles…On the top we have chromosome 6 of the primary-assembly around the HLA-C gene, lower down are a scaffold from ALT_REF_LOCI_2 and a scaffold from ALT_REF_LOCI_7. Between them are grey bars showing the alt-to-primary alignments with red lines indicating mismatches, and blue hour-glasses indicating insertions or deletions.The green bar shows the extent of the gene. The blue bar shows the mRNA, with thick parts representing exons and thin parts introns. The red bar is the coding sequence. The two boxes below point out that our annotation used different RefSeq mRNAs for the HLA-C gene on primary vs ALT_REF_LOCI_2, and the comments on these mRNA records identify them as representing different HLA-C alleles. NT_113891.3 Homo sapiens chromosome 6 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_2 HSCHR6_MHC_COX_CTG1
NT_167249.2 Homo sapiens chromosome 6 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_7 HSCHR6_MHC_SSTO_CTG1
http://www.ncbi.nlm.nih.gov/nuccore/NC_000006.12?report=graph&from=31268240&to=31272643&strand=true&app_context=Gene&assm_context=GCF_000001405.26
The GRC has not yet released any patches to GRCh38, so I went back to our annotation of GRCh37.p13 for an example of how annotation can be improved by a patch scaffold.On the top is chromosome 8 from the primary assembly in the region of the EPPK1 gene. This primary assembly has an internal deletion in this gene that was corrected in the patch scaffold shown below. The patch-to-primary alignment shown as this grey bar, helped us to annotate the EPPK1 gene at corresponding locations on the primary assembly chromosome & patch scaffold.
My final example, is…
I just quickly want to tell you about some recent enhancements to the genome annotation pipeline, the first of which is using RNA-Seq data as evidence for gene prediction. We have annotated 75 eukaryotes using RNA-Seq data over the last eighteen months.The chart on the left shows that adding RNA-Seq increased the number of genes predicted by about 20% on average for the organisms in this chart.The chart on the right shows that adding RNA-Seq resulted in an even bigger increase in the number of coding transcripts, as more transcript variants were annotated, an average increase of 88% for these reference assemblies. For many of the less well studies eukaryotes that we have annotated, RNA-Seq data has provided the primary source of evidence for generating gene models.
Here is a graphical view of the Xenopus tropicalis nbr1 gene.We generated four model transcript variants for the nbr1 gene, shown in green. Below the genes track are three tracks generated using RNA-Seq data.The first shows a histogram of the RNA-Seq coverage of the exons, the second track, which looks like a mirror image of the first, shows a histogram of the intron-spanning reads, and the third track shows the intron features inferred from the RNA-Seq alignments.
In April this year we enhanced our annotation pipeline by adding gap-filling of gene models using transcript sequences. What this means is that we can extend model RefSeqs into assembly gaps based on alignments of transcripts or Transcriptome Shotgun Assemblies. How this works is illustrated here. This genomic sequence contains a gap. A transcript aligns to the genomic sequence except for the 5’ end which falls in the gap in the assembly. We can construct a model RefSeq transcript based on the transcript sequence for exon 1 and the genomic sequence for exons 2, 3 & 4.>> Gap-filled regions are reported in the flat file record of transcript and protein models. <Example is platypus model XM_007659754.1>
Last November we began making annotation reports for each annotation run that we do, both as a web page and as XML files. I will give you a just a quick taste of what is in each report.At the top of the report is the Annotation Release information, which includes the date the input data was frozen, and the date the annotation was made public. The assemblies that were annotated are also shown in this section. >> Gene and feature statistics, for example compare Gene or CDS counts between different assemblies or assembly units.>> RefSeq transcript alignment quality report. RefSeq alignments can be used as a metric for relative assembly quality.>> RNA-Seq alignment details
The GRC has not released any patches to GRCh38 yet, so I went back to our annotation of GRCh37.p13 for an example of how annotation can be improved by a patch scaffold.
This patch scaffold adds an component that extends the chromosome 17 sequence <point in Tiling Path track>. This is reflected in the Patch to Primary alignment which ends at the junction with the new component.
The DOC2B gene on the primary assembly chromosome is partial (as indicated by the double black arrows and the grey bar). It is missing the last three exons. The extra sequence in the patch included the missing exons, hence, the DOC2B gene on the patch scaffold is complete.
The patch-to-primary alignment helped us to annotate the DOC2B gene at corresponding locations on the primary assembly chromosome & patch scaffold.
NW_004070872.2 Homo sapiens chromosome 17 genomic patch of type FIX, GRCh37.p13 PATCHES HG417_PATCH
NC_000017.10 Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly
http://www.ncbi.nlm.nih.gov/nuccore/NW_004070872.2?report=graph&from=81581&to=126850&strand=true&app_context=Gene&assm_context=GCF_000001405.25