This document summarizes a presentation given by Luke Hickey of Pacific Biosciences on human genome sequencing using PacBio systems. It discusses PacBio sequencing technology developments, sequencing and assembly of the NA12878 genome, and the role of the NIST Genome in a Bottle (GIAB) reference materials. Specifically, it notes that PacBio sequenced the GIAB Ashkenazim trio genomes to high coverage and made the data publicly available. The sequencing and assembly of these genomes helps validate and improve PacBio sequencing technologies and supports the development and release of the trio as new NIST reference materials.
Illumina Infinium sequencing is a next-generation sequencing technique that uses sequencing by synthesis. It involves randomly fragmenting DNA, ligating adapters, and amplifying fragments on a flow cell in clusters through bridge amplification. Sequencing occurs by adding fluorescently labeled, reversible terminator nucleotides one at a time while the fluorescence is detected to determine the sequence of each cluster. This allows for massively parallel sequencing of many DNA fragments simultaneously.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
This document provides an overview of phylogenetic analysis concepts and methods. It begins with an introduction to phylogenetic trees and their components. It then covers two main approaches to building trees - using distance methods like neighbor-joining and using optimality criteria like maximum parsimony. Key steps in both approaches like multiple sequence alignment and tree-building algorithms are described. The document concludes with discussing tools for evaluating tree reliability through bootstrapping and exploring available phylogenetics programs.
The National Center for Biotechnology Information (NCBI) was established in 1988 as part of the National Library of Medicine. NCBI houses numerous biomedical databases including those related to genes, proteins, molecular structures, gene expression, and biomedical literature. Users can utilize various tools on the NCBI site to search databases, perform sequence alignments using BLAST, and submit new sequences. Some key databases include GenBank (nucleotide sequences), PubMed (biomedical literature), and RefSeq (non-redundant reference sequences).
BLAST is a program that compares a query DNA or protein sequence against a database to find sequences that resemble the query above a certain threshold. It works by breaking the query into short words and searching the database for those words, then extending any matches. The output includes alignments of high-scoring segment pairs and statistical measures like E-values and bit scores to indicate match significance. BLAST is faster than FASTA and more specific due to low complexity filtering.
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
This document summarizes a presentation given by Luke Hickey of Pacific Biosciences on human genome sequencing using PacBio systems. It discusses PacBio sequencing technology developments, sequencing and assembly of the NA12878 genome, and the role of the NIST Genome in a Bottle (GIAB) reference materials. Specifically, it notes that PacBio sequenced the GIAB Ashkenazim trio genomes to high coverage and made the data publicly available. The sequencing and assembly of these genomes helps validate and improve PacBio sequencing technologies and supports the development and release of the trio as new NIST reference materials.
Illumina Infinium sequencing is a next-generation sequencing technique that uses sequencing by synthesis. It involves randomly fragmenting DNA, ligating adapters, and amplifying fragments on a flow cell in clusters through bridge amplification. Sequencing occurs by adding fluorescently labeled, reversible terminator nucleotides one at a time while the fluorescence is detected to determine the sequence of each cluster. This allows for massively parallel sequencing of many DNA fragments simultaneously.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
This document provides an overview of phylogenetic analysis concepts and methods. It begins with an introduction to phylogenetic trees and their components. It then covers two main approaches to building trees - using distance methods like neighbor-joining and using optimality criteria like maximum parsimony. Key steps in both approaches like multiple sequence alignment and tree-building algorithms are described. The document concludes with discussing tools for evaluating tree reliability through bootstrapping and exploring available phylogenetics programs.
The National Center for Biotechnology Information (NCBI) was established in 1988 as part of the National Library of Medicine. NCBI houses numerous biomedical databases including those related to genes, proteins, molecular structures, gene expression, and biomedical literature. Users can utilize various tools on the NCBI site to search databases, perform sequence alignments using BLAST, and submit new sequences. Some key databases include GenBank (nucleotide sequences), PubMed (biomedical literature), and RefSeq (non-redundant reference sequences).
BLAST is a program that compares a query DNA or protein sequence against a database to find sequences that resemble the query above a certain threshold. It works by breaking the query into short words and searching the database for those words, then extending any matches. The output includes alignments of high-scoring segment pairs and statistical measures like E-values and bit scores to indicate match significance. BLAST is faster than FASTA and more specific due to low complexity filtering.
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
Next generation sequencing techniques allow for high-throughput DNA sequencing at a lower cost compared to Sanger sequencing. The document focuses on Illumina sequencing and 454 pyrosequencing. In Illumina sequencing, DNA fragments are attached to a flow cell and undergo bridge amplification and sequencing by synthesis using fluorescently labeled nucleotides. 454 pyrosequencing involves emulsion PCR to amplify DNA fragments attached to beads, followed by sequencing using DNA polymerase and a bioluminescent detection of incorporated nucleotides. Both techniques allow for massively parallel sequencing of millions of DNA fragments.
This document provides an overview of DNA sequencing technologies. It begins with a brief history of DNA sequencing, including the discovery of DNA's structure and Sanger sequencing. The document then focuses on next generation sequencing technologies, describing several platforms such as 454 sequencing, Illumina sequencing, Ion Torrent sequencing, and Pacific Biosciences sequencing. It also discusses third generation sequencing and compares the sequencing approaches, workflows, and applications of various sequencing technologies. In conclusion, the document notes the progress and future directions of sequencing, including increased clinical applications and reduced costs.
The next generation sequencing platform of roche 454creativebiogene1
454 is totally different from Solexa and Hiseq of Illumina. The disadvantage of 454 is that it is unable to accurately measure the homopolymer length. For this unavoidable reason, 454 technology will introduce insertion and deletion sequencing errors to the results.
Single nucleotide polymorphism by kk sahuKAUSHAL SAHU
Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide differs between individuals. SNPs are the most common type of genetic variation and can be found in both coding and non-coding regions. They may alter protein function or gene regulation. SNPs can act as biological markers for locating genes associated with diseases like cancer. Studying SNPs helps researchers understand genetic influences on disease susceptibility and drug responses.
Pathways and genomes databases in bioinformaticssarwat bashir
The document discusses the PAGED database, which integrates various bioinformatics databases to enable molecular phenotype discovery. PAGED contains over 25,000 gene sets from sources like pathways, disease-gene associations, gene signatures, microRNA targets, and protein-protein interaction networks. It allows users to explore relationships between gene sets and identify pathways, signatures, and modules associated with specific human diseases. The database was designed to integrate data from several sources and allow comprehensive searches and analysis to further biological research.
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
A physical map of a chromosome or a genome that shows the physical locations of genes and other DNA sequences of interest. Physical maps are used to help scientists identify and isolate genes by positional cloning.
According to the ICSM (Intergovernmental Committee on Surveying and Mapping), there are five different types of maps: General Reference, Topographical, Thematic, Navigation Charts and Cadastral Maps and Plans.
Bioinformatics involves the analysis of biological information using computers and statistical techniques,
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
The sequence alignment is made between a known sequence and unknown sequence or between two unknown sequences. The known sequence is called reference sequence. The unknown sequence is called query sequence .
BLAST stands for Basic Local Alignment Search Tool. It addresses a fundamental problem in bioinformatics research. BLAST tool is used to compare a query sequence with a library or database of sequences.
In Bioinformatics, is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.
BLAST was developed by stochastic model of Samuel Karlin and Stephen Altschul in 1990. They proposed “a method for estimating similarities between the known DNA sequence of one organism with that of another”.
A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query sequence) with a library or database of sequences and identify database sequences that resemble the query sequence above a certain threshold.
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
The document discusses FASTA, a sequence alignment software tool. It describes the history and development of FASTA, which was originally designed for protein sequence similarity searching and later expanded to support DNA and translated DNA searches. FASTA uses local sequence alignment and heuristic methods to quickly search databases and find similar sequences. It supports various types of searches for protein, nucleotide, and translated sequences.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
This presentation is explains about the genome sequencing, its traditional method and modern method. This basically focus on Next Generation Sequencing and its types.
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
This document provides an overview of pairwise sequence alignment and BLAST. It discusses how pairwise alignment works using substitution matrices to assign homology between sites. It demonstrates the dynamic programming approach to pairwise alignment calculation and describes how local alignments are identified. The document also introduces BLAST and how it uses word matching to rapidly identify similar sequences in a database and then performs local alignments on matching regions.
- Multiple sequence alignment (MSA) is the alignment of three or more biological sequences, such as DNA, RNA, or protein sequences. It involves identifying regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
- Several algorithms and tools exist for generating MSAs, including ClustalW, T-Coffee, MUSCLE, MAFFT, and MSA. They employ different methods like progressive alignment and iterative refinement. The choice of algorithm depends on factors like number of sequences and computational time.
- MSAs aid in tasks like identifying conserved motifs and domains, phylogenetic analysis, and predicting protein structure and function. Formats like FASTA are used for representing MS
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
2016. daisuke tsugama. next generation sequencing (ngs) for plant researchFOODCROPS
This document provides an overview of next-generation sequencing (NGS) for plant research. It discusses the main NGS platforms, data analysis procedures including assembly, mapping, and applications such as RNA-seq, genome sequencing, RAD-seq, MutMap, and QTL-seq. The document aims to explain what NGS is, typical analysis workflows, and how NGS can be applied to questions in plant research.
The document discusses similarity searches of sequence databases using BLAST and FASTA. It describes the importance of identifying similar sequences to infer shared biological function. BLAST and FASTA use heuristic algorithms to rapidly identify local similarities between query and database sequences. The statistical significance of alignments is assessed using an extreme value distribution to calculate p-values and e-values, which estimate the probability of observing a given alignment score by chance. This allows filtering of random matches and identification of biologically meaningful similarities.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
The document provides an overview of the history and development of DNA sequencing technologies. It discusses early methods like Sanger sequencing and Maxam-Gilbert sequencing. It then summarizes major next-generation sequencing platforms like Illumina, Pacific Biosciences, and Oxford Nanopore. The document also covers sequencing trends, costs, and considerations for choosing a sequencing platform.
This was presented on Mar 31, 2015 at Boyce Thompson Institute, Ithaca, NY at the 3rd BTI Bioinformatics Course http://btiplantbioinfocourse.wordpress.com/
Next generation sequencing techniques allow for high-throughput DNA sequencing at a lower cost compared to Sanger sequencing. The document focuses on Illumina sequencing and 454 pyrosequencing. In Illumina sequencing, DNA fragments are attached to a flow cell and undergo bridge amplification and sequencing by synthesis using fluorescently labeled nucleotides. 454 pyrosequencing involves emulsion PCR to amplify DNA fragments attached to beads, followed by sequencing using DNA polymerase and a bioluminescent detection of incorporated nucleotides. Both techniques allow for massively parallel sequencing of millions of DNA fragments.
This document provides an overview of DNA sequencing technologies. It begins with a brief history of DNA sequencing, including the discovery of DNA's structure and Sanger sequencing. The document then focuses on next generation sequencing technologies, describing several platforms such as 454 sequencing, Illumina sequencing, Ion Torrent sequencing, and Pacific Biosciences sequencing. It also discusses third generation sequencing and compares the sequencing approaches, workflows, and applications of various sequencing technologies. In conclusion, the document notes the progress and future directions of sequencing, including increased clinical applications and reduced costs.
The next generation sequencing platform of roche 454creativebiogene1
454 is totally different from Solexa and Hiseq of Illumina. The disadvantage of 454 is that it is unable to accurately measure the homopolymer length. For this unavoidable reason, 454 technology will introduce insertion and deletion sequencing errors to the results.
Single nucleotide polymorphism by kk sahuKAUSHAL SAHU
Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide differs between individuals. SNPs are the most common type of genetic variation and can be found in both coding and non-coding regions. They may alter protein function or gene regulation. SNPs can act as biological markers for locating genes associated with diseases like cancer. Studying SNPs helps researchers understand genetic influences on disease susceptibility and drug responses.
Pathways and genomes databases in bioinformaticssarwat bashir
The document discusses the PAGED database, which integrates various bioinformatics databases to enable molecular phenotype discovery. PAGED contains over 25,000 gene sets from sources like pathways, disease-gene associations, gene signatures, microRNA targets, and protein-protein interaction networks. It allows users to explore relationships between gene sets and identify pathways, signatures, and modules associated with specific human diseases. The database was designed to integrate data from several sources and allow comprehensive searches and analysis to further biological research.
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
A physical map of a chromosome or a genome that shows the physical locations of genes and other DNA sequences of interest. Physical maps are used to help scientists identify and isolate genes by positional cloning.
According to the ICSM (Intergovernmental Committee on Surveying and Mapping), there are five different types of maps: General Reference, Topographical, Thematic, Navigation Charts and Cadastral Maps and Plans.
Bioinformatics involves the analysis of biological information using computers and statistical techniques,
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
The sequence alignment is made between a known sequence and unknown sequence or between two unknown sequences. The known sequence is called reference sequence. The unknown sequence is called query sequence .
BLAST stands for Basic Local Alignment Search Tool. It addresses a fundamental problem in bioinformatics research. BLAST tool is used to compare a query sequence with a library or database of sequences.
In Bioinformatics, is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences.
BLAST was developed by stochastic model of Samuel Karlin and Stephen Altschul in 1990. They proposed “a method for estimating similarities between the known DNA sequence of one organism with that of another”.
A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query sequence) with a library or database of sequences and identify database sequences that resemble the query sequence above a certain threshold.
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
The document discusses FASTA, a sequence alignment software tool. It describes the history and development of FASTA, which was originally designed for protein sequence similarity searching and later expanded to support DNA and translated DNA searches. FASTA uses local sequence alignment and heuristic methods to quickly search databases and find similar sequences. It supports various types of searches for protein, nucleotide, and translated sequences.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
This presentation is explains about the genome sequencing, its traditional method and modern method. This basically focus on Next Generation Sequencing and its types.
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
This document provides an overview of pairwise sequence alignment and BLAST. It discusses how pairwise alignment works using substitution matrices to assign homology between sites. It demonstrates the dynamic programming approach to pairwise alignment calculation and describes how local alignments are identified. The document also introduces BLAST and how it uses word matching to rapidly identify similar sequences in a database and then performs local alignments on matching regions.
- Multiple sequence alignment (MSA) is the alignment of three or more biological sequences, such as DNA, RNA, or protein sequences. It involves identifying regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
- Several algorithms and tools exist for generating MSAs, including ClustalW, T-Coffee, MUSCLE, MAFFT, and MSA. They employ different methods like progressive alignment and iterative refinement. The choice of algorithm depends on factors like number of sequences and computational time.
- MSAs aid in tasks like identifying conserved motifs and domains, phylogenetic analysis, and predicting protein structure and function. Formats like FASTA are used for representing MS
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
2016. daisuke tsugama. next generation sequencing (ngs) for plant researchFOODCROPS
This document provides an overview of next-generation sequencing (NGS) for plant research. It discusses the main NGS platforms, data analysis procedures including assembly, mapping, and applications such as RNA-seq, genome sequencing, RAD-seq, MutMap, and QTL-seq. The document aims to explain what NGS is, typical analysis workflows, and how NGS can be applied to questions in plant research.
The document discusses similarity searches of sequence databases using BLAST and FASTA. It describes the importance of identifying similar sequences to infer shared biological function. BLAST and FASTA use heuristic algorithms to rapidly identify local similarities between query and database sequences. The statistical significance of alignments is assessed using an extreme value distribution to calculate p-values and e-values, which estimate the probability of observing a given alignment score by chance. This allows filtering of random matches and identification of biologically meaningful similarities.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
The document provides an overview of the history and development of DNA sequencing technologies. It discusses early methods like Sanger sequencing and Maxam-Gilbert sequencing. It then summarizes major next-generation sequencing platforms like Illumina, Pacific Biosciences, and Oxford Nanopore. The document also covers sequencing trends, costs, and considerations for choosing a sequencing platform.
This was presented on Mar 31, 2015 at Boyce Thompson Institute, Ithaca, NY at the 3rd BTI Bioinformatics Course http://btiplantbioinfocourse.wordpress.com/
This was presented on Mar 11, 2014 at Boyce Thompson Institute, Ithaca, NY at the 3rd BTI Bioinformatics Course http://btiplantbioinfocourse.wordpress.com/
High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting ...Larry Smarr
11.05.13
Invited Presentation
Sanford Consortium for Regenerative Medicine
Salk Institute, La Jolla
Larry Smarr, Calit2 & Phil Papadopoulos, SDSC/Calit2
Title: High Performance Cyberinfrastructure Enabling Data-Driven Science Supporting Stem Cell Research
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
This document outlines plans for multi-site sequencing studies to generate standardized human and bacterial genome sequencing datasets. Samples include a human trio, bacterial isolates, and mixtures, which will be sequenced in triplicate across three sites on various platforms including Illumina HiSeq X Ten, HiSeq 4000, HiSeq 2500, NextSeq 500, Life Tech Ion Proton, Ion S5, Pacific Biosciences, Oxford Nanopore, and others. The goals are to measure intra- and inter-lab variation, sequencing performance at GC extremes, and establish molecular standards for assessing sequencing methods in DNA, RNA, and metagenomics. Data will be analyzed by a team to benchmark tools and published by October 2017.
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Ph...confluent
The document discusses Recursion Pharmaceuticals' migration of their image analysis pipeline from an on-premise Kafka Streams solution to a cloud-based workflow using Dagger, a workflow library built on Kafka Streams. The pipeline processes large volumes of high-throughput cell imaging data to extract features at different levels and perform downstream analysis. The migration addressed issues with the original system's batch processing approach and lack of real-time feedback. Examples are provided of implementing a sample workflow to extract site-level features using both low-level Kafka Streams APIs and the higher-level Dagger abstraction.
08.04.14
Invited Talk
National Astrobiology Institute Executive Council Meeting
Astrobiology Science Conference 2008
Santa Clara Convention Center
Title: High Performance Collaboration
Santa Clara, CA
The document summarizes the agenda for the Birds of a Feather session at SC12 in Salt Lake City on November 13, 2012. The agenda includes awarding the #1, #2, #1 Asian, and #1 European systems on the 40th TOP500 list. It also provides highlights of the 40th list and discussions on the #1 systems over time, first-time sites on the list, and challenges for exascale algorithms and software. Speakers include Hans Meuer, Erich Strohmaier, Jack Dongarra, and Horst Simon.
In this video from the 2017 Argonne Training Program on Extreme-Scale Computing, Pavan Balaji from Argonne presents an overview of system interconnects for HPC.
Watch the video: https://wp.me/p3RLHQ-hA4
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OpenNebulaconf2017US: Rapid scaling of research computing to over 70,000 cor...OpenNebula Project
Since 2008, Harvard Research Computing has undertaken a significant scaling challenge increasing their available HPC and storage from 200 cores and 20TB to over 70,000 cores and 35PB of storage. James will discuss the journey and the highlights of extending the computing to support world class research and education. During the evolution of the computing platforms at Harvard they also helped to support and build the Massachusetts Green High Performance Computing Center which is a dedicated high performance research computing facility in Holyoke, MA. This facility continues to support large scale research computing with sustainable energy and advanced networking. Recently the NESE project (New England Storage Exchange) was funded by the National Science Foundation. This is a multi-petabyte object store that is supported by the existing MGHPCC facility supporting the region. The Data Science Initiative at Harvard has also been recently announced and will require even further advanced computation to support their research faculty. Now as the world takes a grip on "cloud" but more importantly remotely provisioned infrastructure, hybrid models for compute and storage are required along with flexibility to be able to further accelerate science. James will discuss their strategy moving forwards and the current and existing infrastructures in place to allow for seamless provisioning of research computing. Justin Riley Team Lead at Harvard, will follow this talk with a deep technical discussion of the specific implementation of the systems that Harvard are designing in concert with the development teams and leadership at OpenNebula to support research computing to make their platforms more resilient and able to continue to scale.
最新の HPC 技術を生かした AI・ビッグデータインフラの東工大 TSUBAME3.0 及び産総研 ABCINVIDIA Japan
- The document discusses the latest HPC technologies used in AI/Big Data infrastructures such as TSUBAME3.0 at Tokyo Institute of Technology and ABCI at AIST.
- It provides an overview of the capabilities and achievements of these supercomputers, including TSUBAME2.0 receiving the 2011 ACM Gordon Bell Prize.
- It emphasizes that future supercomputers need to focus on "BYTES" capabilities like bandwidth and capacity to better support large-scale data processing for AI/Big Data applications.
Data-driven design of cell factories and communitiesLaura Berry
Presented in the Synthetic Biology & Gene Editing strand of the 4Bio Summit. For more information, visit:
www.global-engage.com
Integration of omics data and systems biology models should optimise knowledge gain from synthetic biology experimentation. However, data is not leveraged effectively, due to a lack of readily available tools. In this presentation, Niko Sonnenschein from the Novo Nordisk Foundation describes a project which aims to make omics data useful to biotechnology and life science research by integrating systems biology with design in one platform.
Generating high-quality human reference genomes using PromethION nanopore seq...Miten Jain
The document describes using PromethION nanopore sequencing to generate high-quality human reference genomes. 11 reference genomes were sequenced in 9 days using PromethION, achieving high consensus accuracy (>99%) and continuity. The approach leverages long reads for assembly followed by polishing and scaffolding. This high-throughput and accurate method can generate reference genomes at an estimated cost of $10,000 per genome.
This document discusses some of the big challenges posed by large amounts of genomic data. It notes that while the Human Genome Project took 13 years, a single genome can now be sequenced in weeks due to projects like 1000 Genomes and advances in sequencing technologies from companies like Pacific Biosystems, Oxford Nanopore, and Complete Genomics. However, the large data outputs, for example 150TB/week from the Sanger Institute vs 20TB/week for an individual, far exceed available computing and storage resources. It also argues that the costs of long-term storage and resequencing will become comparable if data volumes continue increasing at their current rates.
An open access resource portal for arthropod vectors and agricultural pathosy...Surya Saha
AgriVectors.org is a systems biology resource for vector biologists that aims to provide omics resources and databases to identify targets for interdiction molecules. It utilizes a distributed data schema to rapidly release genome assemblies and transcriptomes. Undergraduate students manually curate genes and pathways of interest from NCBI gene models. The site also provides web-based tools to visualize and analyze high-dimensional experimental data like proteomics and gene expression networks. The goal is to build an ecosystem of integrated resources and tools to study vector-pathogen-host systems important for agriculture.
Functional annotation of invertebrate genomesSurya Saha
Functional annotation of the Asian citrus psyllid genome identified genes, assigned gene ontology terms, and mapped genes to pathways. Gene ontology and pathway analysis of differentially expressed genes between infected and uninfected psyllids identified enriched terms involved in the cytoskeleton, endocytosis, and mitochondrial dysfunction. Improved functional annotation using GOanna added depth to the gene ontology annotation and identified additional enriched pathways related to response to hypoxia and regulation of cytoskeletal remodeling.
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
Rapidly spreading invasive diseases in systems with little or no prior experimental data or resources pose a unique set of challenges for growers, scientists as well as regulators. As a part of a USDA NIFA CAPS project focused on the psyllid, Diaphorina citri, we have released improved genomics resources including high quality genome assemblies and annotation. We have also created an open access web portal for analyses around the Citrus Greening/Huanglongbing disease complex. Citrusgreening.org includes pathosystem-wide resources and bioinformatics tools for multiple Citrus spp. hosts, the Asian citrus psyllid vector (ACP, Diaphorina citri), and multiple pathogens including Candidatus Liberibacter asiaticus (CLas). To the best of our knowledge, this is the first example of a database to use the pathosystem as a holistic framework to understand an insect transmitted plant disease. Users can submit relevant data sets to enable sharing and allow the community to leverage their data within an integrated system. The system includes the metabolic pathway databases CitrusCyc and DiaphorinaCyc with organism specific pathways that can be used to mine metabolomics, transcriptomics and proteomics results to identify pathways and regulatory mechanisms involved in disease response. The Psyllid Expression Network (PEN) contains expression profiles of ACP genes from multiple life stages, tissues, conditions and hosts. The Citrus Expression Network (CEN) contains public expression data from multiple tissues and conditions for various citrus hosts. All tools connect to a central database. The portal also includes electrical penetration graph (EPG) recordings, information about citrus rootstock trials and metabolomics data in addition to traditional omics data types with a goal of combining and mining all information related to the Huanglongbing pathosystem. User-friendly manual curation tools will allow the continuous improvement of knowledge base as more experimental research is published. The portal can be accessed at https://citrusgreening.org/.
Updates on Citrusgreening.org database from USDA NIFA project meetingSurya Saha
The document discusses the citrusgreening.org portal and its resources for researching citrus greening disease. It provides pathway databases for the Asian citrus psyllid vector and citrus pathogens, as well as expression networks showing gene expression data. It outlines current and future work including a psyllid annotation update, new citrus and psyllid RNA-seq data, and potential methods for studying the insect-pathogen interaction like genomics, transcriptomics, and epigenomics. The document envisions an AgriVectors knowledge base to integrate pathosystem data from multiple sources.
Updates on the ACP v3 genome and annotation from USDA NIFA project meetingSurya Saha
ACP version 3 genome, official gene set version 3 and Isoseq transcriptome
Prashant Hosmani, Mirella Flores-Gonzalez, Lukas Mueller, Surya Saha
5th Annual Meeting
Indian River State College
Fort Pierce, FL
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant DiseasesSurya Saha
Arthropod vectors of pathogens cause enormous economic losses and are a fundamental challenge for sustainable increases in food production, yet agricultural pathosystems remain an underserved area of research. To more effectively fight plant diseases, data pertaining to a disease system needs to be consolidated, made searchable and amenable to data mining. The AgriVectors platform is an open access and comprehensive resource for growers, researchers and industry working on plant pathogens and pathosystems spread by arthropod vectors. The portal connects established public repositories with pathosystem-specific data repositories. The AgriVectors system will provide tools to enable technologies such as RNAi, CRISPR, screening bioassays, etc. to leverage current and emerging knowledge across disciplines. It will also include private and unpublished data, using passwords and secure protocols for restricted access. The portal will be based on the Citrusgreening.org (https://citrusgreening.org/) community resource that was developed as a model for systems biology of tritrophic disease complexes. Citrusgreening.org provides omics and biology resources for the Huanglongbing pathosystem. In addition, it includes a biochemical pathway database for each organism in this disease complex, and an expression atlas with proteomics and RNAseq data from psyllids (http://pen.citrusgreening.org) and citrus (http://cen.citrusgreening.org) across multiple infection states. The AgriVectors portal will extend this model beyond gene-centric omics data to the broader Pathosystem-wide information, with integrated pest management, behavioral, plant health, soil health and climate data to incorporate rapid phenotyping information from research trials, building a foundation for more effectively identifying solutions to combat plant diseases.
Visualization of insect vector-plant pathogen interactions in the citrus gree...Surya Saha
This document summarizes Surya Saha's presentation on using omics approaches to study the interactions between the Asian citrus psyllid vector, Candidatus Liberibacter asiaticus pathogen, and citrus plants in the citrus greening pathosystem. Key points include the generation of a new reference genome for the Asian citrus psyllid, assembly of genomes for its endosymbionts, development of an online annotation platform for manual gene curation, generation of an isoform-level psyllid transcriptome, analysis of gene expression networks in the psyllid in response to different conditions, and discovery of differences in how psyllid life stages respond transcriptionally to the citrus
Deciphering the genome of Diaphorina citri to develop solutions for the citru...Surya Saha
The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the causal agent for the citrus greening or Huanglongbing disease which threatens citrus industry worldwide. This vector is the primary target of approaches to stop the transmission of the pathogen. Accurate structural and functional annotation of the psyllid’s gene models and understanding its interactions with the pathogenic bacterium, CLas, is required for precise targeting using molecular methods such as RNAi. We opted for manual curation of gene families in the draft genome of D. citri (Diaci v1.1, contig N50 34.4Kb) that have key functional roles in D. citri biology and pathology. The community effort resulted in Official Gene Set v1.0 with more than 500 manually curated gene models across developmental, RNAi regulatory, and immune-related pathways.
Single copy marker analysis of the current genome shows a significant proportion of 3,350 markers conserved in Hemipterans to be missing (25%) with only 74% present in full-length copies. The manual genome annotation also identified a number of misassemblies and missing genes in the current genome. This is, in-part, due to the complexity introduced when assembling a heterogeneous sample containing DNA from multiple psyllids and exacerbated by the use of short reads. This challenge is common with insect genomes due to the size of individuals. To improve quality of genome assembly, we generated 36.2Gb of Pacbio long reads with a coverage of 80X for the 450Mb psyllid genome. The Canu assembler followed by Dovetail Chicago-based scaffolding was used to create an improved assembly (Diaci v2.0) with a contig N50 of 758.7kb and 1906 contigs. The assembly was polished with Pacbio and Illumina paired-end reads to remove indel and SNP errors. We are employing Dovetail Chicago and 10X Illumina libraries generated from a single psyllid in conjunction with Bionano optical maps to achieve long-range scaffolding of the genome. We have also generated full-length cDNA transcripts from diseased and healthy tissue from multiple life stages with the Pacbio IsoSeq technology. This will be the first time all these methods have been applied to resolve a complex insect genome from a highly heterogeneous sample. The new assembly will be available on https://citrusgreening.org/ which is our portal for all omics resources for the citrusgreening disease. We are continuing with the manual curation effort using the improved genome. We will also present how the improved genome and annotation is contributing to the development of molecular interdiction methods to disrupt the vectoring ability of D. citri.
The document discusses quality control of sequencing data. It covers exploration of data files using command line tools, evaluation of read quality metrics like quality scores and length distributions using FastQC, and preprocessing reads by trimming low quality ends and removing short reads using fastq-mcf. Exercises guide exploring a protein fasta file, evaluating quality of Illumina datasets for tomato, and preprocessing the reads.
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
CitrusCyc is a metabolic pathway database for the Citrus clementina and Citrus sinensis genomes. It was constructed using the Pathway Tools software and contains pathways, reactions, enzymes and genes derived from the annotated citrus genomes and the MetaCyc database. The database contains over 25,000 proteins and 40,000 transcripts with EC numbers for both citrus species. It provides visualizations of metabolic pathways and allows for overlay of RNA-seq expression data. Future work includes manual curation of pathways and development of a Meta-CitrusCyc database.
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...Surya Saha
The document discusses efforts to improve the genome assembly of the Asian citrus psyllid (Diaphorina citri), the insect vector of citrus greening disease. It describes using long read sequencing data from PacBio to generate a new assembly with an N50 of 83kb, a significant improvement over the previous N50 of 34kb. It further discusses additional efforts using technologies like Dovetail scaffolding, 10X Genomics, and optical mapping to further improve scaffolding and resolve haplotypes, with the goal of generating a high-quality reference genome for D. citri.
The document summarizes updates to the tomato genome sequence, including:
1) The tomato genome build SL3.0 integrated over 1000 BAC sequences into the previous build SL2.50, improving contiguity and reducing gaps.
2) The BAC sequences were assembled, aligned to SL2.50, and automatically integrated using a published workflow. Integrated BACs then underwent manual and NCBI validation.
3) Compared to SL2.50, the new build SL3.0 has fewer and smaller sequence gaps, representing an improved tomato genome assembly. Future plans include integrating additional sequences and producing new gene annotations.
This was presented on Mar 31, 2015 at Boyce Thompson Institute, Ithaca, NY at the 3rd BTI Bioinformatics Course http://btiplantbioinfocourse.wordpress.com/
The tomato reference genome is one of the most widely used genomic resources in the Solanaceae as well as the wider plant research community. We frequently receive questions from the community regarding the assembly versions. This session will explain the changes in the current version of the tomato genome (SL2.50). The current tomato genome build contains numerous inter-contig gaps (median 931bp, mean 1869bp) and inter-scaffold gaps (median 210Kbp, mean 525Kbp). Updates will be provided regarding the forthcoming tomato genome build (SL3.0) that will include finished BACs (HTGS phase 3) for closing the gaps.
This document outlines exercises for quality control of NGS data from an Illumina sequencing experiment on tomato ripening stages. The exercises include: 1) evaluating raw fastq files for format and number of sequences; 2) using FastQC to analyze read quality scores, lengths, duplication levels, and k-mer content; and 3) preprocessing the reads using fastq-mcf to trim low quality ends and remove short reads before reanalyzing with FastQC. The goal is to learn how to evaluate NGS read quality and preprocess data prior to downstream analysis.
Sequencing, Genome Assembly and the SGN PlatformSurya Saha
This talk was presented at IASRI Pusa on June 13th, 2014.
Centre for Agricultural Bioinformatics
Indian Agricultural Statistics Research Institute
Library Avenue, Pusa, New Delhi - 110012 (INDIA)
http://cabgrid.res.in/cabin/
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
Authoring a personal GPT for your research and practice: How we created the Q...
Sequencing 2017
1. Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, NY
suryasaha@cornell.edu // Twitter:@SahaSurya
BTI Plant Bioinformatics Course 2017
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die
2. Earth BioGenome Project (EBP)
3/28/2017 BTI Plant Bioinformatics Course 2017 2
• Complete genome of 1
representative from each
eukaryotic family (9000)
• Low coverage sequencing of a
species from each of the 150,000
to 200,000 genera
• Budget estimate $4.8 billion
Maybe better to sequence less to
higher quality and invest in
interpretation???
http://omicsomics.blogspot.com/2017/02/earth-biogenome-project-ill-conceived.html
4. First generation sequencing
3/28/2017 BTI Plant Bioinformatics Course 2017 4
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention
7. Sanger method
3/28/2017 BTI Plant Bioinformatics Course 2017 7
Frederick Sanger
13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and
1980. Published the dideoxy chain termination
method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
9. First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very very low throughput
3/28/2017 BTI Plant Bioinformatics Course 2017 9
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/
11. Use the specific technology used
to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS I/RS II
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore
3/28/2017 BTI Plant Bioinformatics Course 2017 11
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-
diepart-2
12. 454 Pyrosequencing
One purified DNA
fragment, to one bead, to
one read.
3/28/2017 BTI Plant Bioinformatics Course 2017 12
http://www.genengnews.com/
GS FLX
Titanium
https://mariamuir.com/wp-
content/uploads/2013/04/rip.gif
13. Illumina
3/28/2017 BTI Plant Bioinformatics Course 2017 13
Output 15 Gb 120 GB 1500 GB 1800 GB
Max Number
of Reads/
Run
25 Million 400 Million 5 Billion 6 Billion
Max Read
Length
2x300 bp 2x150 bp 2x125- 2x250 bp (RR mode) 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
550
14. Illumina
3/28/2017 BTI Plant Bioinformatics Course 2017 14
Output 15 Gb 120 GB 1500 GB 1800 GB
Max Number
of Reads/
Run
25 Million 400 Million 5 Billion 6 Billion
Max Read
Length
2x300 bp 2x150 bp 2x125- 2x250 bp (RR mode) 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
550
15. Illumina
3/28/2017 BTI Plant Bioinformatics Course 2017 15
Output 15 Gb 120 GB 1500 GB 1800 GB
Max Number
of Reads/
Run
25 Million 400 Million 5 Billion 6 Billion
Max Read
Length
2x300 bp 2x150 bp 2x125- 2x250 bp (RR mode) 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
2500
3000
4000
500
550
18. Pacific Biosciences SMRT sequencing
Single Molecule Real
Time sequencing
3/28/2017 BTI Plant Bioinformatics Course 2017 18
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
RS II
Sequel
19. Pacific Biosciences SMRT sequencing
Error correction methods
3/28/2017 BTI Plant Bioinformatics Course 2017 19
Hierarchical genome-assembly
process (HGAP)
English et al., PLOS One. 2012
PBJelly
24. 3/28/2017 BTI Plant Bioinformatics Course 2017 24
http://lab.loman.net/2017/03/09/ultrareads-for-nanopore/
E. coli K-12 MG1655 on a standard
FLO-MIN106 (R9.4) flowcell
25. Next generation sequencing
3/28/2017 BTI Plant Bioinformatics Course 2017 25
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing
24h 700 bp Q20-Q30 1 GB $10
Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15
Illumina Hiseq
2500
1 - 10days 2x250bp >Q30 3000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences
30m - 4h 10kb - >40kb
>Q50 consensus
>Q10 single
500 - 1000MB
/SMRT cell
$0.13 - $0.60
http://www.hindawi.com/journals/bmri/2012/251364/
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227
Note: Some figures might be out of date
28. 3/28/2017 BTI Plant Bioinformatics Course 2017 28
http://mms.businesswire.com/media/20150225005296/en
/454639/5/GemCodePlatform.jpg
• Long read information from short reads using 14bp bar codes
• Very low input DNA ( as low as 0.625 ng)
• Short library preparation time
• 1ng of DNA is split across 100,000 Gel Coated Beads (GEMs)
• Chromium instrument for single-cell RNAseq
GemCode
31. 3/28/2017 BTI Plant Bioinformatics Course 2017 31
Human MHC map
• Sample prep requires very high molecular weight DNA
• Nicks at 10 sites / 100kb
• Individual molecules are assembles into optical maps
• Optical maps and sequences are merged in a hybrid assembly
http://www.bionanogenomics.com/technology/why-genome-mapping/
38. Library Types
Single end
Pair end (PE, 150-300 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
3/28/2017 38
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano Bombarely
BTI Plant Bioinformatics Course 2017
39. Implications of Choice of Library
3/28/2017 39
Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN
BTI Plant Bioinformatics Course 2017
40. Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify
different samples in the same lane/sector.
3/28/2017 40
Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCA
TGAGCA
TGAGCA
TGAGCA
Sequencing
BTI Plant Bioinformatics Course 2017
42. Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
3/28/2017 42
Slide credit: Aureliano Bombarely
BTI Plant Bioinformatics Course 2017
43. Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually
nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
3/28/2017 43
Slide credit: Aureliano Bombarely
File Formats
BTI Plant Bioinformatics Course 2017
46. 3/28/2017 46
Quality control: Encoding
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:
Qphred = -10 log10 (e)
where e is the estimated error probability of a base
BTI Plant Bioinformatics Course 2017