This document describes a project submitted by Ashish Singh Tomar to fulfill the requirements for a Bachelor's Degree in Bioinformatics. The project involves developing a next-generation sequencing (NGS) data analysis pipeline using the R statistical package. The document includes sections acknowledging contributions from colleagues and supervisors, as well as chapters outlining the aims, methodology, and results of the project.
1) The document discusses a study analyzing the impact of gene length on detecting differentially expressed genes using RNA-seq technology.
2) The study will first test the reproducibility of RNA-seq and the effect of normalization. It will then compare different statistical tests for identifying differentially expressed genes.
3) Finally, the study will specifically test how gene length impacts the likelihood of a gene being identified as differentially expressed, as longer genes are easier to map with short reads.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document discusses using artificial neural networks (ANN) and adaptive neuro-fuzzy inference systems (ANFIS) to predict promoter regions in genomic DNA sequences. It analyzes 106 DNA sequences from E. coli, each 57 nucleotides long, labeled as having a promoter region (+ label) or not (- label). ANN and ANFIS classifiers are trained on most of the data and tested on the remaining data using 5-fold cross-validation. The classifiers are evaluated based on accuracy, Matthews correlation coefficient, sensitivity, and specificity metrics. The results show that ANN and ANFIS are promising approaches for identifying promoter regions that compete with existing techniques.
This document summarizes an ncRNA analysis pipeline called nc-aReNA. It describes how nc-aReNA can be used to classify and analyze small non-coding RNAs from deep RNA sequencing data. The pipeline performs tasks such as adapter removal, quality control checks, mapping reads to reference databases to identify known ncRNA classes, filtering of rRNA sequences, quantification of ncRNA expression, differential expression analysis, and identification of isomiRs. Two test cases demonstrating ncRNA classification and differential expression analysis using nc-aReNA on mouse datasets are also described.
The Genoox integrated SV platform combines optical genome mapping using Saphyr with next-generation sequencing data to detect structural variants with high sensitivity and specificity. It considers variants identified through both approaches simultaneously throughout the calling pipeline for refined breakpoint positions. Comparison to the GIAB v0.6 truth set showed the Genoox caller identified 1076 true positives out of 1285 variants in high-confidence regions, with 95% of putative false positives actually being correct comparisons and 87% of putative false negatives falling outside Bionano-covered regions.
Bioinformatics tools are essential for analyzing next-generation sequencing (NGS) data. The summary describes the typical stages of NGS data analysis:
1. Primary analysis involves demultiplexing, base calling and quality control to produce fastq files.
2. Secondary analysis maps reads to a reference genome to produce SAM/BAM files and calls variants to produce VCF files.
3. Tertiary analysis annotates and filters variants to prioritize those relevant to disease.
Analytical Study of Hexapod miRNAs using Phylogenetic Methodscscpconf
MicroRNAs (miRNAs) are a class of non-coding RNAs that regulate gene expression.
Identification of total number of miRNAs even in completely sequenced organisms is still an
open problem. However, researchers have been using techniques that can predict limited
number of miRNA in an organism. In this paper, we have used homology based approach for
comparative analysis of miRNA of hexapoda group .We have used Apis mellifera, Bombyx
mori, Anopholes gambiae and Drosophila melanogaster miRNA datasets from miRBase
repository. We have done pair wise as well as multiple alignments for the available miRNAs in
the repository to identify and analyse conserved regions among related species. Unfortunately,
to the best of our knowledge, miRNA related literature does not provide in depth analysis of
hexapods. We have made an attempt to derive the commonality among the miRNAs and to
identify the conserved regions which are still not available in miRNA repositories. The results
are good approximation with a small number of mismatches. However, they are encouraging and may facilitate miRNA biogenesis for hexapods.
Whole genome sequencing analysis involves aligning sequencing reads, calling variants like SNPs, indels and structural variants, and annotating the variants. The pipeline includes alignment with tools like BWA, variant calling with GATK for SNPs/indels and Lumpy/Manta for structural variants, and annotation with Ensembl VEP. WGS can detect more types of variants than exome sequencing and has increasing diagnostic utility for diseases.
1) The document discusses a study analyzing the impact of gene length on detecting differentially expressed genes using RNA-seq technology.
2) The study will first test the reproducibility of RNA-seq and the effect of normalization. It will then compare different statistical tests for identifying differentially expressed genes.
3) Finally, the study will specifically test how gene length impacts the likelihood of a gene being identified as differentially expressed, as longer genes are easier to map with short reads.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document discusses using artificial neural networks (ANN) and adaptive neuro-fuzzy inference systems (ANFIS) to predict promoter regions in genomic DNA sequences. It analyzes 106 DNA sequences from E. coli, each 57 nucleotides long, labeled as having a promoter region (+ label) or not (- label). ANN and ANFIS classifiers are trained on most of the data and tested on the remaining data using 5-fold cross-validation. The classifiers are evaluated based on accuracy, Matthews correlation coefficient, sensitivity, and specificity metrics. The results show that ANN and ANFIS are promising approaches for identifying promoter regions that compete with existing techniques.
This document summarizes an ncRNA analysis pipeline called nc-aReNA. It describes how nc-aReNA can be used to classify and analyze small non-coding RNAs from deep RNA sequencing data. The pipeline performs tasks such as adapter removal, quality control checks, mapping reads to reference databases to identify known ncRNA classes, filtering of rRNA sequences, quantification of ncRNA expression, differential expression analysis, and identification of isomiRs. Two test cases demonstrating ncRNA classification and differential expression analysis using nc-aReNA on mouse datasets are also described.
The Genoox integrated SV platform combines optical genome mapping using Saphyr with next-generation sequencing data to detect structural variants with high sensitivity and specificity. It considers variants identified through both approaches simultaneously throughout the calling pipeline for refined breakpoint positions. Comparison to the GIAB v0.6 truth set showed the Genoox caller identified 1076 true positives out of 1285 variants in high-confidence regions, with 95% of putative false positives actually being correct comparisons and 87% of putative false negatives falling outside Bionano-covered regions.
Bioinformatics tools are essential for analyzing next-generation sequencing (NGS) data. The summary describes the typical stages of NGS data analysis:
1. Primary analysis involves demultiplexing, base calling and quality control to produce fastq files.
2. Secondary analysis maps reads to a reference genome to produce SAM/BAM files and calls variants to produce VCF files.
3. Tertiary analysis annotates and filters variants to prioritize those relevant to disease.
Analytical Study of Hexapod miRNAs using Phylogenetic Methodscscpconf
MicroRNAs (miRNAs) are a class of non-coding RNAs that regulate gene expression.
Identification of total number of miRNAs even in completely sequenced organisms is still an
open problem. However, researchers have been using techniques that can predict limited
number of miRNA in an organism. In this paper, we have used homology based approach for
comparative analysis of miRNA of hexapoda group .We have used Apis mellifera, Bombyx
mori, Anopholes gambiae and Drosophila melanogaster miRNA datasets from miRBase
repository. We have done pair wise as well as multiple alignments for the available miRNAs in
the repository to identify and analyse conserved regions among related species. Unfortunately,
to the best of our knowledge, miRNA related literature does not provide in depth analysis of
hexapods. We have made an attempt to derive the commonality among the miRNAs and to
identify the conserved regions which are still not available in miRNA repositories. The results
are good approximation with a small number of mismatches. However, they are encouraging and may facilitate miRNA biogenesis for hexapods.
Whole genome sequencing analysis involves aligning sequencing reads, calling variants like SNPs, indels and structural variants, and annotating the variants. The pipeline includes alignment with tools like BWA, variant calling with GATK for SNPs/indels and Lumpy/Manta for structural variants, and annotation with Ensembl VEP. WGS can detect more types of variants than exome sequencing and has increasing diagnostic utility for diseases.
The EPIGEN project aims to advance epigenetics research in Italy by creating centralized platforms providing researchers access to new technologies. It has established next generation sequencing, bioinformatics, and cell imaging platforms to overcome limitations in Italian research facilities. The project has over 70 participating labs organized into 8 subprojects. It has already produced over 1.5 trillion bases of sequencing data through its sequencing center and established databases and analysis pipelines through its bioinformatics platform to enable collaborative epigenomics research.
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
Genome in a Bottle (GIAB) provides benchmark genomes to evaluate the accuracy of variant calling from whole genome sequencing. GIAB has characterized 7 human genomes to date, including difficult variants. The benchmark calls continue to evolve as new data and methods are integrated. While current benchmarks enable validation of "easier" variants, GIAB is working to characterize more difficult variants and regions. This will allow validation of clinical tests focused on difficult sites. GIAB data and analyses are openly available to support method development and technology optimization.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
The document summarizes new services offered by the Genomics and Epigenomics Shared Resource (GESR) at Georgetown, including quantitative DNA methylation analysis, custom SNP genotyping, microarray analysis of SNPs, copy number variation and gene expression, and microRNA profiling. It provides contact information for directors and staff of GESR.
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...QIAGEN
Gene expression profiling is the key to understanding biological pathways and complex cellular systems. In this webinar we will discuss the challenges of targeted RNA-seq data analysis and present the solutions provided by the QIAGEN automated online data analysis tools. Using raw sequencing data from targeted sequencing, the output of the QIAseq primary data analysis tool and the options in QIAseq secondary analysis, such as normalization strategies, will be described. The use of Ingenuity Pathway Analysis (IPA) to unlock the molecular insights buried in experimental data by quickly identifying relationships, mechanisms, functions, and pathways of relevance will be shown with an example.
The document discusses various topics related to molecular profiling and personalized medicine. It describes first generation molecular profiling techniques like gene sequencing, microarrays, and PCR. It then covers next generation sequencing technologies like Roche 454, Illumina, and ABI SOLID. It also discusses second generation techniques for DNA and RNA profiling including exome sequencing, ChIP-seq, and RNA-seq. Finally, it briefly mentions third generation sequencing and epigenetic profiling.
This document provides an introduction to next-generation sequencing (NGS) technology. It discusses the evolution of genomic science from Sanger sequencing to NGS. The basics of NGS chemistry including library preparation, cluster generation, sequencing, and data analysis are described. Advances in NGS such as paired-end sequencing, tunable coverage, library preparation improvements, and multiplexing are also summarized. Finally, common NGS methods like whole genome sequencing, RNA sequencing, and targeted sequencing are briefly introduced.
Presentation carried out by Sergi Beltran Agulló, from the CNAG, at the course: Identification and analysis of sequence variants in sequencing projects: fundamentals and tools .
The document describes a presentation given by Gunnar Rätsch on tools for RNA-seq analysis and isoform characterization. It discusses the increasing amounts of biological data and challenges in developing accurate analysis algorithms. The presentation covers multiple tools developed by Rätsch's group for analyzing RNA-seq data, including tools for transcript quantification, multiple read mapping, alternative splicing analysis and detection of novel isoforms. The tools aim to improve RNA-seq analysis for large datasets and characterization of transcript isoforms and splicing.
This document summarizes the Genome in a Bottle (GIAB) Consortium's efforts to characterize structural variants in human genomes to serve as benchmarks. The GIAB Consortium has generated structural variant calls for 7 human genomes using diverse data types and analysis methods. The document describes the GIAB Consortium's process for integrating these data to identify high-confidence structural variant calls to include in version 0.6 of the structural variant benchmark set. It provides examples of different types of structural variants characterized and evaluates the trustworthiness of the benchmark calls based on independent validation. The document also discusses ongoing efforts to further improve structural variant characterization using emerging long-read technologies.
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...QIAGEN
Pancreatic cancer is a uniquely lethal malignancy characterized by frequent mutations in KRAS, CDKN2A, SMAD4, TP53 and many others. We have shown that KRAS mutation can be detected in cell-free, circulating tumor DNA (ctDNA) isolated from the plasma in a subset of patients and is associated with poor prognosis. The ability to simultaneously detect multiple pancreatic cancer-specific mutations in ctDNA would open a new avenue for detection of clinically-relevant mutations. In this study, we performed ultra-deep sequencing of ctDNA from advanced pancreatic cancer patients prior to treatment with Gemcitabine and Erlotinib following target enrichment. Somatic, non-synonymous variants were identified in 29 different genes at allele frequencies typically less than 0.5%. Updated results of ultra-deep NGS analysis will be presented.
This document discusses analytical validation needs for next-generation sequencing, including somatic variants. It notes that targeted sequencing and whole genome sequencing have similar analytical validation requirements but reference data needs to cover all genomic regions. There is limited utility in benchmarking or reference pipelines as custom assay development often uses custom informatics. Increased transparency is needed on exact intervals tested and validation metrics based on variant type and allelic fraction.
This document discusses the history and evolution of DNA sequencing technologies. It begins with early manual sequencing methods developed in the 1970s by Sanger and others. Automated Sanger sequencing and the sequencing of larger genomes followed in the 1980s-1990s. Next generation sequencing (NGS) methods were developed starting in 1996 and became commercially available in 2005, enabling massively parallel sequencing. NGS platforms such as 454, Illumina, and SOLiD are discussed. Third generation real-time sequencing methods such as PacBio and nanopore sequencing are also introduced, providing longer read lengths. The document compares key parameters of different sequencing methods such as read length, accuracy, throughput, cost and advantages/disadvantages.
The document describes Phase II of the ABRF Next Generation Sequencing Study which aims to establish reference data sets for evaluating DNA sequencing performance across multiple platforms and laboratories. Phase II will sequence various human and bacterial genomic samples to assess accuracy, coverage, and limits of detection using different platforms and library preparation methods. A collaboration with NIST Genome in a Bottle will provide standardized samples to the participating laboratories. The study aims to provide a resource for ongoing method development and evaluation of sequencing performance.
This document summarizes a presentation about assembling the major histocompatibility complex (MHC) region of the human genome. It discusses the importance of accurately phasing HLA genes in the MHC region for organ transplantation matching. It describes using long reads, trio sequencing data, and other techniques to generate "perfect" haplotig assemblies of the MHC region with fully phased HLA genes. It acknowledges some remaining challenges like resolving repeats and integrating assembly and mapping-based variant calls to create the most accurate reference. The goal is to solve the complex MHC puzzle at scale using long read technologies to create a next-generation MHC database.
This document discusses various methods for annotating genomes after sequencing and assembly. Sequence analysis approaches like identifying open reading frames can rapidly and inexpensively find some genes, but have weaknesses like false positives and missing short genes. More accurate methods are needed to find non-coding RNAs, pseudogenes, and other elements. As sequencing technologies generate more data, the bottleneck has shifted to analysis, requiring skills in both biology and mathematics. The document provides an example sequence to annotate and poses questions about fast, cheap and accurate annotation methods.
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
This document provides an overview of the Genome in a Bottle (GIAB) Consortium's efforts to develop human genome reference materials and benchmarks for evaluating genome sequencing and variant calling. It summarizes the characterization of 7 human genomes, including developing variant calls, regions, and reference values. It also describes new efforts using linked and long reads to characterize structural variants and difficult genomic regions. The goal is to provide reference materials and benchmarks to help evaluate sequencing performance and accuracy across different technologies and algorithms.
Next generation-sequencing.ppt-convertedShweta Tiwari
The advance version, sequences the whole genome efficiently with high speed and high throughput sequencing at reduce cost is termed as Next Generation Sequencing (NGS) or massively parallel sequencing (MPS).
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
1) Pairwise sequence alignment is a method to compare two biological sequences like DNA, RNA, or proteins. It involves arranging the sequences in columns to highlight their similarities and differences.
2) There are many possible alignments between two sequences, but most imply too many mutations. The best alignment minimizes the number of mutations needed to explain the differences between the sequences.
3) For short protein sequences like "QKGSYPVRSTC" and "QKGSGPVRSTC", the optimal alignment implies one single mutation occurred since the sequences diverged from a common ancestor.
The EPIGEN project aims to advance epigenetics research in Italy by creating centralized platforms providing researchers access to new technologies. It has established next generation sequencing, bioinformatics, and cell imaging platforms to overcome limitations in Italian research facilities. The project has over 70 participating labs organized into 8 subprojects. It has already produced over 1.5 trillion bases of sequencing data through its sequencing center and established databases and analysis pipelines through its bioinformatics platform to enable collaborative epigenomics research.
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
Genome in a Bottle (GIAB) provides benchmark genomes to evaluate the accuracy of variant calling from whole genome sequencing. GIAB has characterized 7 human genomes to date, including difficult variants. The benchmark calls continue to evolve as new data and methods are integrated. While current benchmarks enable validation of "easier" variants, GIAB is working to characterize more difficult variants and regions. This will allow validation of clinical tests focused on difficult sites. GIAB data and analyses are openly available to support method development and technology optimization.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
The document summarizes new services offered by the Genomics and Epigenomics Shared Resource (GESR) at Georgetown, including quantitative DNA methylation analysis, custom SNP genotyping, microarray analysis of SNPs, copy number variation and gene expression, and microRNA profiling. It provides contact information for directors and staff of GESR.
Molecular insight into Gene Expression Using Digital RNAseq: Digital RNAseq W...QIAGEN
Gene expression profiling is the key to understanding biological pathways and complex cellular systems. In this webinar we will discuss the challenges of targeted RNA-seq data analysis and present the solutions provided by the QIAGEN automated online data analysis tools. Using raw sequencing data from targeted sequencing, the output of the QIAseq primary data analysis tool and the options in QIAseq secondary analysis, such as normalization strategies, will be described. The use of Ingenuity Pathway Analysis (IPA) to unlock the molecular insights buried in experimental data by quickly identifying relationships, mechanisms, functions, and pathways of relevance will be shown with an example.
The document discusses various topics related to molecular profiling and personalized medicine. It describes first generation molecular profiling techniques like gene sequencing, microarrays, and PCR. It then covers next generation sequencing technologies like Roche 454, Illumina, and ABI SOLID. It also discusses second generation techniques for DNA and RNA profiling including exome sequencing, ChIP-seq, and RNA-seq. Finally, it briefly mentions third generation sequencing and epigenetic profiling.
This document provides an introduction to next-generation sequencing (NGS) technology. It discusses the evolution of genomic science from Sanger sequencing to NGS. The basics of NGS chemistry including library preparation, cluster generation, sequencing, and data analysis are described. Advances in NGS such as paired-end sequencing, tunable coverage, library preparation improvements, and multiplexing are also summarized. Finally, common NGS methods like whole genome sequencing, RNA sequencing, and targeted sequencing are briefly introduced.
Presentation carried out by Sergi Beltran Agulló, from the CNAG, at the course: Identification and analysis of sequence variants in sequencing projects: fundamentals and tools .
The document describes a presentation given by Gunnar Rätsch on tools for RNA-seq analysis and isoform characterization. It discusses the increasing amounts of biological data and challenges in developing accurate analysis algorithms. The presentation covers multiple tools developed by Rätsch's group for analyzing RNA-seq data, including tools for transcript quantification, multiple read mapping, alternative splicing analysis and detection of novel isoforms. The tools aim to improve RNA-seq analysis for large datasets and characterization of transcript isoforms and splicing.
This document summarizes the Genome in a Bottle (GIAB) Consortium's efforts to characterize structural variants in human genomes to serve as benchmarks. The GIAB Consortium has generated structural variant calls for 7 human genomes using diverse data types and analysis methods. The document describes the GIAB Consortium's process for integrating these data to identify high-confidence structural variant calls to include in version 0.6 of the structural variant benchmark set. It provides examples of different types of structural variants characterized and evaluates the trustworthiness of the benchmark calls based on independent validation. The document also discusses ongoing efforts to further improve structural variant characterization using emerging long-read technologies.
Utilization of NGS to Identify Clinically-Relevant Mutations in cfDNA: Meet t...QIAGEN
Pancreatic cancer is a uniquely lethal malignancy characterized by frequent mutations in KRAS, CDKN2A, SMAD4, TP53 and many others. We have shown that KRAS mutation can be detected in cell-free, circulating tumor DNA (ctDNA) isolated from the plasma in a subset of patients and is associated with poor prognosis. The ability to simultaneously detect multiple pancreatic cancer-specific mutations in ctDNA would open a new avenue for detection of clinically-relevant mutations. In this study, we performed ultra-deep sequencing of ctDNA from advanced pancreatic cancer patients prior to treatment with Gemcitabine and Erlotinib following target enrichment. Somatic, non-synonymous variants were identified in 29 different genes at allele frequencies typically less than 0.5%. Updated results of ultra-deep NGS analysis will be presented.
This document discusses analytical validation needs for next-generation sequencing, including somatic variants. It notes that targeted sequencing and whole genome sequencing have similar analytical validation requirements but reference data needs to cover all genomic regions. There is limited utility in benchmarking or reference pipelines as custom assay development often uses custom informatics. Increased transparency is needed on exact intervals tested and validation metrics based on variant type and allelic fraction.
This document discusses the history and evolution of DNA sequencing technologies. It begins with early manual sequencing methods developed in the 1970s by Sanger and others. Automated Sanger sequencing and the sequencing of larger genomes followed in the 1980s-1990s. Next generation sequencing (NGS) methods were developed starting in 1996 and became commercially available in 2005, enabling massively parallel sequencing. NGS platforms such as 454, Illumina, and SOLiD are discussed. Third generation real-time sequencing methods such as PacBio and nanopore sequencing are also introduced, providing longer read lengths. The document compares key parameters of different sequencing methods such as read length, accuracy, throughput, cost and advantages/disadvantages.
The document describes Phase II of the ABRF Next Generation Sequencing Study which aims to establish reference data sets for evaluating DNA sequencing performance across multiple platforms and laboratories. Phase II will sequence various human and bacterial genomic samples to assess accuracy, coverage, and limits of detection using different platforms and library preparation methods. A collaboration with NIST Genome in a Bottle will provide standardized samples to the participating laboratories. The study aims to provide a resource for ongoing method development and evaluation of sequencing performance.
This document summarizes a presentation about assembling the major histocompatibility complex (MHC) region of the human genome. It discusses the importance of accurately phasing HLA genes in the MHC region for organ transplantation matching. It describes using long reads, trio sequencing data, and other techniques to generate "perfect" haplotig assemblies of the MHC region with fully phased HLA genes. It acknowledges some remaining challenges like resolving repeats and integrating assembly and mapping-based variant calls to create the most accurate reference. The goal is to solve the complex MHC puzzle at scale using long read technologies to create a next-generation MHC database.
This document discusses various methods for annotating genomes after sequencing and assembly. Sequence analysis approaches like identifying open reading frames can rapidly and inexpensively find some genes, but have weaknesses like false positives and missing short genes. More accurate methods are needed to find non-coding RNAs, pseudogenes, and other elements. As sequencing technologies generate more data, the bottleneck has shifted to analysis, requiring skills in both biology and mathematics. The document provides an example sequence to annotate and poses questions about fast, cheap and accurate annotation methods.
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
This document provides an overview of the Genome in a Bottle (GIAB) Consortium's efforts to develop human genome reference materials and benchmarks for evaluating genome sequencing and variant calling. It summarizes the characterization of 7 human genomes, including developing variant calls, regions, and reference values. It also describes new efforts using linked and long reads to characterize structural variants and difficult genomic regions. The goal is to provide reference materials and benchmarks to help evaluate sequencing performance and accuracy across different technologies and algorithms.
Next generation-sequencing.ppt-convertedShweta Tiwari
The advance version, sequences the whole genome efficiently with high speed and high throughput sequencing at reduce cost is termed as Next Generation Sequencing (NGS) or massively parallel sequencing (MPS).
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
1) Pairwise sequence alignment is a method to compare two biological sequences like DNA, RNA, or proteins. It involves arranging the sequences in columns to highlight their similarities and differences.
2) There are many possible alignments between two sequences, but most imply too many mutations. The best alignment minimizes the number of mutations needed to explain the differences between the sequences.
3) For short protein sequences like "QKGSYPVRSTC" and "QKGSGPVRSTC", the optimal alignment implies one single mutation occurred since the sequences diverged from a common ancestor.
The document discusses next generation sequencing methods and RNA sequencing. It covers topics like sequencing formats, data analysis workflows including mapping, clustering, assembly programs, finding new genes and correcting existing ones. It discusses input file types, calculating sequencing depth, available tools for alignment, output file formats, assembly programs, splice junction prediction, and applications of RNA sequencing like gene expression analysis and annotation.
LUGM-Update of the Illumina Analysis PipelineHai-Wei Yen
Illumina VariantStudio is a powerful annotation tool for analyzing and interpreting variants from NGS data. It allows users to import VCF or gVCF files, annotate variants using various databases, filter variants, classify variants, and generate customizable reports. VariantStudio streamlines the analysis workflow from raw data to meaningful biological insights.
IRJET-A System for Recognition of Indian Sign Language for Deaf People using ...IRJET Journal
Manisha D.Raut, Pallavi Dhok, Ketan Machhale, Jaspreet Manjeet Hora "A System for Recognition of Indian Sign Language for Deaf People using Otsu’s Algorithm", International Research Journal of Engineering and Technology (IRJET), Volume2,issue-01 April 2015.e-ISSN:2395-0056, p-ISSN:2395-0072. www.irjet.net
Abstract
Sign Language Recognition System is one of the important researches today in engineering field. Number of methods are been developed recently in the field of Sign Language Recognition for deaf and dumb people. It is very useful to the deaf and dumb people to convey their message to other people. In this paper we proposed some methods, through which the recognition of the signs becomes easy for peoples while communication. We use the different symbols of signs to convey the meanings. And the result of those symbols signs will be converted into the text. In this project, we are capturing hand gestures through webcam and convert this image into gray scale image. The segmentation of gray scale image of a hand gesture is performed using Otsu thresholding algorithm.. Total image level is divided into two classes one is hand and other is background. The optimal threshold value is determined by computing the ratio between class variance and total class variance. To find the boundary of hand gesture in image Canny edge detection technique is used.
This document provides an overview of DNA and protein sequence databases and bioinformatics tools for analyzing genetic and protein sequence data. It lists various databases for nucleotide and protein sequences including GenBank, EMBL, and UniProt. It also outlines tools for tasks like sequence submission, retrieval, assembly, alignment, homology searching, structure prediction, and more. The document serves as a comprehensive reference guide for accessing biological sequence data and analyzing sequences computationally.
1. Use Ensembl to locate the FXN gene on the human genome and identify other genes implicated in pancreatic cancer. Export sequence data for FXN and related genes in FASTA format.
2. Use BLAST to find similar sequences to FXN and related genes and make alignments to identify conserved regions.
3. Access protein structure databases like PDB and view the 3D structure of the frataxin protein to understand its molecular function.
This document provides an introduction to parsimony methods for phylogenetic tree analysis. It begins by defining phylogenetic trees and their use in representing evolutionary relationships. It then describes the basic structure of rooted vs unrooted trees and bifurcating trees. The document introduces parsimony analysis, which works by determining the best fit of character data to potential tree models and finding the most parsimonious trees that minimize evolutionary changes. It concludes by discussing some advantages of parsimony methods in being simple and having good statistical properties, as well as some disadvantages like underestimating branch lengths and having an implicit evolutionary model.
This document discusses sequence alignment and the differences between global and local alignment. It defines sequence alignment as comparing two or more sequences to find identical or similar characters in the same order. Global alignment attempts to align the entire sequences, while local alignment finds the regions of highest similarity that may only be part of the sequences. Dynamic programming is used to calculate optimal alignments through initialization of a scoring matrix, filling it, and tracing back the highest scores. The Needleman-Wunch algorithm performs global alignment, while Smith-Waterman performs local alignment by setting negative scores to zero to terminate early alignments.
To submit a sequence to NCBI, there are two main tools that can be used: BankIt and Sequin. BankIt is a web-based tool for simple submissions like single sequences or small batches. Sequin is an offline software for more complex submissions. The submission process involves providing contact information, release date, reference information, organism name, sequence data, and annotating any features. Valid sequences must be at least 200 nucleotides long unless they are complete exons or non-coding RNA.
Final Year Project-Gesture Based Interaction and Image ProcessingSabnam Pandey, MBA
This document summarizes a student's final year project report on developing a gesture recognition system for browsing pictures. The student aims to implement algorithms for skin and contour detection of a user's hand in real-time images from a webcam. The report includes chapters on literature review of gesture recognition and image processing techniques, methodology using the waterfall model, requirements analysis and design diagrams, implementation details using OpenCV, and testing and evaluation of the project objectives and aims.
Smart Health Prediction Using Data Mining.Data mining is a new powerful technology which is of high interest in computer world. It is a sub field of computer science that uses already existing data in different databases to transform it into new researches and results. It makes use of Artificial Intelligence, machine learning and database management to extract new patterns from large data sets and the knowledge associated with these patterns. The actual task is to extract data by automatic or semi-automatic means. The different parameters included in data mining includes clustering, forecasting, path analysis and predictive analysis.
Hand gesture recognition system(FYP REPORT)Afnan Rehman
This document is a final year project report submitted by three students - Afnan Ur Rehman, Haseeb Anser Iqbal, and Anwaar Ul Haq - for their bachelor's degree in computer science. The report describes the development of a hand gesture recognition system using computer vision and machine learning techniques. Key aspects of the project include image acquisition using a webcam, preprocessing the images using techniques like filtering and noise removal, detecting and cropping the hand region, extracting HU moments features, training a classifier on sample gesture images, and classifying new images using KNN. The system is also able to translate recognized gestures to speech using text-to-speech.
This document discusses the use of 16S ribosomal RNA (rRNA) gene sequencing for bacterial identification and phylogenetic analysis. It explains that the 16S rRNA gene is highly conserved, making it useful for comparing distantly related organisms. The document outlines the process of 16S rRNA gene sequencing, including PCR amplification using conserved primer regions and sequencing of variable regions. It also discusses various methods that have been developed using 16S rRNA, such as TRFLP profiling and ribotyping, to study microbial communities.
This document discusses identifying mutations in the filaggrin gene through sequence analysis. The filaggrin gene codes for filaggrin proteins that are essential for skin barrier function. Mutations in this gene are linked to conditions like eczema and asthma. The study aims to detect faulty filaggrin genes, identify other human and non-human proteins with similar function to filaggrin, and find identical protein sequences to help develop therapeutic options. Sequence alignment methods like pairwise alignment and BLAST will be used to analyze filaggrin genes and identify similar protein sequences.
Psdot 14 using data mining techniques in heartZTech Proje
The document proposes applying data mining techniques to identify suitable heart disease treatments. It discusses using single and hybrid data mining on diagnosis and treatment data to determine if models can reliably predict treatments as they do diagnoses. The proposed system would apply various data mining algorithms to both diagnosis and treatment data to investigate if hybrid models improve treatment prediction accuracy over single techniques.
1. O documento descreve os principais aspectos do projeto de experimento para RNAseq, incluindo o design experimental, a complexidade do transcriptoma, e as aplicações da transcriptômica de próxima geração.
2. Fatores como heterozigozidade, poliploidia, isoformas alternativas de splicing, estágios de desenvolvimento e partes do organismo afetam a complexidade do projeto.
3. O design experimental deve levar esses fatores em consideração para obter dados de RNAseq confiáveis.
The document discusses the history and evolution of DNA sequencing techniques. It describes first generation Sanger sequencing and how next generation sequencing (NGS) allows for massively parallel sequencing of entire human genomes in a single day. The principles of NGS involve fragmenting DNA, ligating adaptors, sequencing in parallel, and reassembling the results. Common NGS methods include sequencing by synthesis, pyrosequencing, and ion semiconductor sequencing. Applications of NGS include rapidly sequencing whole genomes, detecting rare mutations, studying gene expression, and analyzing the human microbiome.
This article investigates the relationship between accuracy, sensitivity, specificity, and positive predictive value (PPV) parameters in detecting differentially expressed genes using microarray data compared to RNA-seq data. The study finds:
1) Sensitivity and specificity parameters show a strong negative correlation, indicating microarrays have difficulty achieving high values for both.
2) Specificity, accuracy, and PPV show good correlation with each other and lower variance, suggesting they reliably estimate a microarray's ability to detect differentially expressed genes.
3) Sensitivity shows poor correlation and higher variance compared to the other parameters, indicating it is not a good sole indicator of a microarray's performance.
This article investigates the relationship between accuracy, sensitivity, specificity, and positive predictive value (PPV) parameters in detecting differentially expressed genes using microarray data compared to RNA-seq data. The study finds:
1) Sensitivity and specificity parameters show a strong negative correlation, indicating microarrays have difficulty combining high values for both.
2) Specificity, accuracy, and PPV show good correlation and low variance differences, suggesting they reliably estimate a microarray's ability to detect differentially expressed genes.
3) Sensitivity has a weaker relationship with the other parameters and higher variance, meaning it is not as reliable an indicator of detecting differentially expressed genes for these microarray designs.
The research and application progress of transcriptome sequencing technology (i)creativebiolabs11
This document discusses the application of transcriptome sequencing technology. It describes how transcriptome sequencing can be used to detect mutations by analyzing sequence information from all transcripts, including SNPs and indels. It also discusses how transcriptome sequencing can be applied to determine gene expression patterns, discover new transcripts, study the regulatory mechanisms of non-coding RNA, and analyze single cell transcriptomes. The development of RNA-seq technology provides an effective means for studying transcriptional regulatory networks and their relationship to traits.
This document evaluates the sensitivity and specificity of four grape microarray designs in detecting differentially expressed genes (DEGs) compared to RNA-Seq and qRT-PCR technologies. The results show that:
1) Microarrays are less sensitive and specific than RNA-Seq and qRT-PCR for detecting DEGs with small expression changes. However, they show good agreement with RNA-Seq and qRT-PCR for highly expressed genes.
2) The sensitivity and specificity of microarrays increases with higher gene expression levels. Microarrays perform best in detecting DEGs for highly expressed genes.
3) While microarray designs vary in detecting all DEGs, designs using multiple long probes show the best correlation with RNA-Seq
This document summarizes a study that evaluated the sensitivity and specificity of microarray gene expression analysis by comparing it to RNA-Seq and quantitative RT-PCR (qRT-PCR) technologies. The study analyzed gene expression data from grape samples at two development stages using four different microarray designs. It found that microarrays were less sensitive and specific than RNA-Seq and qRT-PCR for detecting genes with small expression changes. However, microarrays showed high reliability and agreement with RNA-Seq and qRT-PCR for highly expressed genes. The study concluded that microarray design strategies affect the ability to identify differentially expressed genes.
De novo transcriptome assembly of solid sequencing data in cucumis melobioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. In the present study, we have carried out a comparison of two assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo. Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated the least redundant assembly. Since different assemblers use different algorithms to build contigs, wefollowed the merging of assemblies by CAP3 and found that the merged assembly is better than individual assemblies and more consistent in the number and size of contigs. Combining the assemblies from different programs gave a more credible final product, and therefore this approach is recommended for quantitative
output.
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELObioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which
sequence read assembly is the first task. In the present study, we have carried out a comparison of two
assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo.
Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated
the least redundant assembly. Since different assemblers use different algorithms to build contigs, we
followed the merging of assemblies by CAP3 and found that the merged assembly is better than individual
assemblies and more consistent in the number and size of contigs. Combining the assemblies from different
programs gave a more credible final product, and therefore this approach is recommended for quantitative
output
DE NOVO TRANSCRIPTOME ASSEMBLY OF SOLID SEQUENCING DATA IN CUCUMIS MELObioejjournal
As sequencing technologies progress, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. In the present study, we have carried out a comparison of two assemblers (SeqMan and CLC) for transcriptome assembly, using a new dataset from Cucumis melo. Between two assemblers SeqMan generated an excess of small, redundant contigs where as CLC generated
the least redundant assembly. Since different assemblers use different algorithms to build contigs, we followed the merging of assemblies by CAP3 and found that the merged assembly is better than individual assemblies and more consistent in the number and size of contigs. Combining the assemblies from different programs gave a more credible final product, and therefore this approach is recommended for quantitative
output.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document discusses a study that uses the ke-REM (ke-Rule Extraction Method) classifier to predict promoter regions in DNA sequences. The study evaluates the performance of ke-REM compared to existing promoter prediction techniques. ke-REM constructs rules based on attribute-value pairs from a dataset of 106 E. coli DNA sequences, each containing 57 nucleotides. The results show that ke-REM competes well with existing methods for identifying promoter regions in DNA.
Next-generation DNA sequencing technologies have significantly impacted genetics research. Three major platforms - Roche/454, Illumina Genome Analyzer, and Applied Biosystems SOLiD - utilize massively parallel sequencing to generate large amounts of sequence data. Roche/454 uses emulsion PCR to amplify DNA fragments on beads and pyrosequencing to determine sequences. Illumina performs bridge amplification on a flow cell to generate DNA clusters then sequences by synthesis. Applied Biosystems SOLiD uses ligation-based sequencing. These new methods have enabled genome-wide studies and applications such as ancient DNA sequencing and metagenomics that were previously difficult or impossible.
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
For mRNA quantification, RT-PCR and DNA microarrays have been compared in few studies
(RT-PCR). Healing callus of adult and juvenile rats after femur injury was found to be rich in mRNA at
various stages of the healing process. We used both methods to examine ten samples and a total of 26 genes.
Internal DNA probes tagged with 32P were employed in reverse transcription-polymerase chain reaction
(RT-PCR) to identify genes (RT-PCR). Ten Affymetrix® Rat U34A cRNA microarrays were hybridized with
biotin-labeled cRNA generated from mRNA. There was a wide range of correlation coefficients (r) between
RT-PCR and microarray data for each gene. Meaning became genetically unique because of this diversity.
Relatively lowly expressed genes had the highest r values. The distance between PCR primers and
microarray probes was found to be higher than previously assumed, leading to a drop in agreement between
microarray calls and PCR outcomes. Microarray research showed that RT-PCR expression levels for two
genes had a "floor effect." As a result, PCR primers and microarray probes that overlap in mRNA expression
levels can provide good agreement between these two techniques.
Next generation sequencing (NGS) allows for the massively parallel sequencing of DNA sequences. NGS technologies can sequence entire genomes in a single run and provide information useful for pathogen identification, outbreak investigation, and molecular diagnostics. NGS workflows involve sample preparation, sequencing using platforms such as Illumina or Ion Torrent, and bioinformatics analysis to assemble and interpret the large amounts of sequencing data produced. NGS has many applications including mutation discovery, microbial genome mapping, and metagenomics.
Discover new cases studies giving you unprecedented access to both the data and results of how RNA-Seq is being applied successfully from bench to bedside
Gain new insights into RNA-Seq for the study of toxicity, IO, host-viral interactions and more from companies such as BMS, Janssen, Pfizer, Merck, UCSC and Stanford
Applying Innovative RNA-Seq Technology to Support Research & Drug Discovery
Informatics & Statistical Analysis of RNA-Seq Data
Transcriptomics: Microarray vs RNA-Seq
This document provides an agenda and summary for the ERCC 2.0 Workshop on July 10, 2014. The workshop aimed to discuss expanding the scope of the External RNA Control Consortium (ERCC) beyond version 1.0. The day included introductions, presentations from participants on applications of ERCC 1.0, and discussions on the scope and process for ERCC 2.0. Presentations summarized how ERCC 1.0 external controls were used for product development, validation of gene expression methods, and quality control of measurements. The workshop concluded with discussions on forming working groups to scope ERCC 2.0.
High throughput next generation sequencing and robust transcriptome analysis help with gene expression profiling, gene annotation or discovery of non-coding RNA.
Clustering Approaches for Evaluation and Analysis on Formal Gene Expression C...rahulmonikasharma
Enormous generation of biological data and the need of analysis of that data led to the generation of the field Bioinformatics. Data mining is the stream which is used to derive, analyze the data by exploring the hidden patterns of the biological data. Though, data mining can be used in analyzing biological data such as genomic data, proteomic data here Gene Expression (GE) Data is considered for evaluation. GE is generated from Microarrays such as DNA and oligo micro arrays. The generated data is analyzed through the clustering techniques of data mining. This study deals with an implement the basic clustering approach K-Means and various clustering approaches like Hierarchal, Som, Click and basic fuzzy based clustering approach. Eventually, the comparative study of those approaches which lead to the effective approach of cluster analysis of GE.The experimental results shows that proposed algorithm achieve a higher clustering accuracy and takes less clustering time when compared with existing algorithms.
A genetic algorithm approach for predicting ribonucleic acid sequencing data ...TELKOMNIKA JOURNAL
Malaria larvae accept explosive variable lifecycle as they spread across numerous mosquito vector stratosphere. Transcriptomes arise in thousands of diverse parasites. Ribonucleic acid sequencing (RNA-seq) is a prevalent gene expression that has led to enhanced understanding of genetic queries. RNA-seq tests transcript of gene expression, and provides methodological enhancements to machine learning procedures. Researchers have proposed several methods in evaluating and learning biological data. Genetic algorithm (GA) as a feature selection process is used in this study to fetch relevant information from the RNA-Seq Mosquito Anopheles gambiae malaria vector dataset, and evaluates the results using kth nearest neighbor (KNN) and decision tree classification algorithms. The experimental results obtained a classification accuracy of 88.3 and 98.3 percents respectively.
This document discusses the evolution of next-generation sequencing (NGS) technologies over the past decade. It begins by describing how early NGS platforms enabled massively parallel sequencing through the clonal amplification of DNA templates on beads or solid surfaces. It then explains the two main approaches used in NGS - sequencing by ligation (SBL) and sequencing by synthesis (SBS) - and how they identify nucleotide sequences. The document evaluates the benefits and limitations of various NGS platforms and approaches.
1. 1 | P a g e
TITLE OF THE PROJECT
“Development of NGS data analysis pipeline using
R-statistical package”
A PROJECT SUBMITTED TO
DR. D.Y. PATIL VIDYAPEETH (DEEMED UNIVERSITY)
IN PARTIAL FULFILLMENT OF FOUR YEARS
FULL-TIME DEGREE PROGRAMME
B. TECH BIOINFORMATICS
SUBMITTED BY
Ashish Singh Tomar
UNDER THE GUIDANCE OF
Dr. R. Srivatsan
Institute of Bioinformatics and Applied Biotechnology
Biotech Park,
Electronics City Phase I,
Bangalore 560 100.
DR.D.Y.PATIL BIOTECHNOLOGY & BIOINFORMATICS
INSTITUE, TATHAWADE, PUNE – 33
( MAY 2012 )
2. 2 | P a g e
CERTIFICATE
This is to certify that Mr. Ashish Singh Tomar has prepared this
project titled “Development of NGS data analysis pipeline using R-
statistical package”, under my guidance and to my satisfaction, in
fulfillment of the requirement for Bachelors Degree in Bioinformatics.
Signature & Seal of Guide
Guided By
Dr. R. Srivatsan
(Address of the host Institute)
Institute of Bioinformatics and Applied Biotechnology
Biotech Park,
Electronics City Phase I,
Bangalore 560 100.
Director
Dr. D. Y. Patil Biotechnology & Bioinformatics Institute,
Tathawade, Pune - 33
3. 3 | P a g e
ACKNOWLEDGMENT
First of all I would like owe a great thanks to my colleagues who helped me in
understanding some views and also contributed their efforts in tackling
problems.
I would especially like to thanks my guide Dr. R Srivatsan sir for his
impeccable support and guidance without whom this project would have been a
daunting task. He took pain to go through our progress and made necessary
correction as well as suggestions when ever needed.
I will like to thanks IBAB for providing a sound working environment with
high end server facility and giving opportunity to involve in a good project.
Finally I am also thankful to D.Y Patil institute if biotechnology and
bioinformatics who sent me for project work.
4. 4 | P a g e
TABLE OF CONTENTS
CHAPTER I
1.1 INTRODUCTION………………………………………………………………….6
1.2 NEXT-GENERATION SEQUENCING TECHNOLOGIES………………………8
1.3 TYPES OF SEQUENCING METHODS…………………………………….……13
1.4 TYPES OF NEXT-GENERATION SEQUENCING TECHNOLOGIES…………15
1.5 APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING…………………..16
1.6 ANALYSIS OF RNA SEQ DATA………………………………………………...17
1.7 R AND BIOCONDUCTOR………………………………………………………..18
CHAPTER II
2.1 BACKGROUND …………………………………………………………………..22
2.2.FILE FORMATS…………………………………………………………………...22
2.3.ASSEMBLY………………………………………………………………………..23
2.4.ASSEMBLY ALGORITHMS………………………………………………………24
2.5. MAPPING………………………………………………………………………….25
2.3 DEFINITION OF TERMS……………………………………………………….…26
CHAPTER III
3.1 AIMS AND OBJECTIVES…………………………………………………………28
3.2 METHODOLOGY………………………………………………………………..…30
CHAPTER IV
4. RESULTS……………………………………………………………………………34
CHAPTER V
5. CONCLUSIONS………………………………………………………………….….44
6. REFERENCES……………………………………………………………………………….45
5. 5 | P a g e
TABLE OF FIGURES
Figure
Number
Figure Name Page Number
Fig 1 Cost graph of different sequencing technique 7
Fig 2 Base calling 9
Fig 3 Paired end sequencing 12
Fig 4 Mate pair sequencing 12
Fig. 5 Overlap graph and de bruijn graph 24
Fig 6 Flow Chart Of Pipeline 26
Fig 7 Overall Read Quality 33
Fig 8 Per-Cycle Quality Score 34
Fig 9 Read distribution 35
Fig10 Cycle-Specific Base Calls And Read Quality 36
Fig 11 Per Cycle Read Quality 37
Fig 12 Histogram and weighted histogram of contigs
coverage
38
Fig 13 Dinucleotide frequency 38
6. 6 | P a g e
CHAPTER I
1.1 INTRODUCTION
NEXT-GENERATION SEQUENCING
Next-generation sequencing technologies are revolutionizing genomics and their effects are
increasingly widespread. Genome-wide sequencing has enabled modern biomedical research
to discover more and more biomarkers in healthy as well as disease-affected cells and
tissues. The high demand for low-cost sequencing has driven the development of high-
throughput sequencing technologies that parallelize the sequencing process, producing
thousands or millions of sequences at once, called massively parallel DNA sequencing.
Next-generation high-throughput DNA sequencing techniques are opening fascinating
opportunities in the life sciences. Novel fields and applications in biology and medicine are
becoming a reality, much beyond the original goal of the genomic sequencing. Serving as
examples are: personal genomics with detailed analysis of individual genome stretches;
precise analysis of RNA transcripts for gene expression, surpassing and replacing in several
respects analysis by various microarray platforms, for instance precise analysis of DNA
regions interacting with regulatory proteins in functional regulation of gene expression
(Chip-seq). The next-generation sequencing technologies offer novel and rapid ways for
genome-wide characterization and profiling of mRNAs, small RNAs, transcription factor
regions, structure of chromatin and DNA methylation patterns. In gene-expression studies
microarrays are now being replaced by seq-based methods, which can identify and quantify
rare transcripts without prior knowledge of a particular gene and can provide information
regarding alternative splicing and sequence variation in identified genes.
The ability to sequence the whole genome of many related organisms has allowed large-
scale comparative and evolutionary studies that were unimaginable just a few years ago. For
example Metagenomics [1] and HapMap project [2].
The broadest application of NGS is resequencing of human genome to enhance our
understanding of how genetic differences affect health and disease and to know the
difference between individuals at genomic level. Understanding how a small change in
7. 7 | P a g e
genomes give rise to different phenotypes will lead to the development of personalized and
preventative medicine. The power of next-generation sequencing is increasingly exploited to
re-sequence strains and genomes of individuals for which reference genome sequences are
available to understand genomic diversity. Such studies have identified mutations in
bacterial strains, polymorphisms in worm, structural variation in the human genome and
specific alleles involved in cancer. In addition to analysis of genome sequences, NGS has
paved way for new approaches for assay and application such as Chip-seq, Tn-seq, RNA-seq
etc. which will greatly advance our understanding of various phenomena at genomic level.
The principle behind these alternative applications, which have been termed ‘sequence
census’ methods, is simple: complex DNA or RNA samples are directly sequenced to
determine their content without bacterial cloning as a prerequisite.
Given the vast amount of data produced (currently greater than a gigabase per run, with this
constantly increasing as well), developing a sound data storage and management solution
and creating informatics tools to effectively analyze the data are essential to successful
application of the technology.
Next-generation sequencing technologies allow genomes to be sequenced more quickly and
less expensively than previous techniques [fig.1][3]. Next-generation sequencing has proven
to be an extremely effective technology for molecular counting applications where the
number of sequence reads provides a digital readout for RNA-seq, ChIP-seq, Tn-seq and
other applications. Biological pathways consist of complex networks of interacting genes
which are responsible for expression and regulation of other genes. Therefore it is essential
to determine quantitative genetic interaction on a genome wide range to reveal the hidden
mechanism of gene regulation during various diseases. While having a genome wise
annotation and analysis, the main challenge of genome assembly is in identifying repetitive
regions present in most of the mammalian genome which makes it difficult for the
identification of exons or regulatory regions. With reference genome available, short
sequence reads are sufficient to map their locations (except for repeated regions), and once
mapped, millions of sequence hits are simply counted to determine their genomic
distribution.
8. 8 | P a g e
Fig. 1 Cost graph of different sequencing technique.
Next-generation sequencing technologies are now being exploited not only to analyze static
genome, but also dynamic transcriptomes in an approach termed RNA-seq. With the
advancements in next generation sequencing, RNA-seq has emerged as powerful tool for
analyzing transcriptome study. It has been widely applied for both well-studied model
organisms and non-model organisms in order to determine transcript profile of organisms
and to give insights into biological processes. For organisms with unavailable or incomplete
genome, reference sequence mapping strategy is not suitable. Thus, for organisms with un-
sequenced genome or cancer cells with widespread chimeric RNAs, de novo assembly is
essential to provide transcriptome analysis.
Next generation sequencing has made it possible to generate massively parallel and high
resolution DNA sequence data. Its usefulness in various genomic applications such as
genome-wide detection of SNPs, DNA methylation profiling, mRNA expression profiling
and whole-genome re-sequencing is now well recognized. SNPs and single nucleotide
insertions and deletions (INDELs) were detected by scanning the assembled contigs for
positions where the underlying reads significantly disagreed with the consensus base.
9. 9 | P a g e
1.2 NEXT-GENERATION SEQUENCING TECHNOLOGIES:
Sequencing technologies include a number of steps that are broadly identified as template
preparation, sequencing, imaging and data analysis. The unique combination of specific
protocols distinguishes one technology from another and determines the type of data
produced from each platform.
Template preparation: In the first step, the DNA is chopped (sheared) into small pieces
and the pieces of DNA are amplified by PCR method. The amplified pieces are immobilized
on a solid surface to form templates. Millions of templates DNA are allowed for rapid
sequencing at the same time. Some of the NGS technologies use different ways of template
preparation like clonally amplified and single molecule. [1]
Sequencing and imaging: Template preparation mostly composed of clonally amplified
and single molecule templates. The template from these methods are further processed for
sequencing and imaging using the Cyclic Reversible Termination (CRT), Sequencing By
Ligation (SBL), Single Nucleotide Addition (SNA) also called Pyrosequencing, and Real
Time Sequencing (RTS).[1]
Widely Used Platforms:
1. Pyrosequencing by Roche Diagnostics
2. Sequencing By Ligation (SBL) or SOLiD sequencing by Applied Biosystems
3. Real Time Sequencing by Pacific Biosciences
BASE CALLING:
Base-calling usually refers to the conversion of intensity data into sequences and quality
scores. Intensity information is extracted from images by the image analysis.
Base-calling has two aspects: Identifying the base-call and assigning a confidence
estimate to the call.
1. Identifying the base-call: Making a base-call is usually based on the intensity estimates.
Signal-processing needs to correct for confounding factors:
Frequency cross-talk (optical detection mechanism)
10. 10 | P a g e
Phasing effects (imperfect chemistry)
Signal decay
2. Assignment of a confidence estimate: Assignment of a confidence estimate or quality
score is vital for downstream analysis phred method can be extended to Next generation
technologies [4].
Below table shows how base calls are made
Fig. 2 Base calling
Although the data produced are similar between platforms, large differences in accuracy
and quality arise which depends on base calling error probability given by phred score.
These differences in data output should be carefully considered when comparing different
platforms on the basis of data quality, depth of sequencing, no of reads produced and cost.
Phred quality scores were originally developed by the program Phred to help in the
automation of DNA sequencing in the Human Genome Project. Phred quality scores are
assigned to each base call in automated sequencer traces. Phred quality scores have become
widely accepted to characterize the quality of DNA sequences, and can be used to compare
the efficacy of different sequencing methods. Perhaps the most important use of Phred
quality scores is the automatic determination of accurate, quality-based consensus
sequences.
Base for which no Phred
score could be calculated.
An example of base that has been
given Phred score of 10 indicating
there is 90% probability that this
base is correctly assigned.
11. 11 | P a g e
PHRED QUALITY SCORES
A numeric Phred score represents the error probability of a given base call. When a
nucleotide sequence is produced by sequencing, random error results in the possibility that
any given base call may be incorrect. Thus, a quality score is provided for each base. The
phred score can be calculated from the error probability of a given base call:
phred score=-10*log(error probability)/log(10)
Error
Probability
Phred Score
1 0
0.1 10
0.01 20
0.001 30
0.0001 40
Phred Quality Table
When quality scores are used to represent a long sequence (such as in a fastq file), they are
often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to
Illumina scores (The Illumina pipeline produces phred scores, but uses a different ASCII
offset). For example, a Phred score of 40 can be represented as the ASCII char "I"
(40+33=ASCII #73), and an Illumina score of 40 as "h" (40+64=ASCII #104) [12].
PAIRED-END SEQUENCING
Paired-end sequencing is emerging as a key technique for assessing genome rearrangements
and structural variation on a genome-wide scale. Paired end sequencing is a simple
modification to the standard single-read DNA library preparation which facilitates reading
both the forward and reverse template strands. In addition to sequence information, both
reads contain long range positional information, allowing for highly precise alignment of
reads. This technique is particularly useful for detecting copy-neutral rearrangements, such
as inversions and translocations, which are common in cancer and can produce novel fusion
genes. Paired-end sequencing approach allows for a genome-wide survey of all potential
fusion genes and other rearrangements in a tumor.
12. 12 | P a g e
Pair reads are invaluable for short-read data analysis, as a large fraction of short reads are
difficult to map uniquely to the genome, and the second read of a pair can be used to find the
correct location (it is said that the first read is ‘rescued’ by the second).[6]
MATE PAIR SEQUENCING
Mate Pair Library Sequencing makes it possible to create libraries with inserts from 2 to 5
kb in size. DNA is fragmented into 2-5kb segments that are end-repaired with biotin labeled
dNTPs. The labeled fragments are circularized and then fragmented again into 400-600bp
pieces. Fragments with the biotin labels are enriched, end-repaired, and ligated with adapters
used for downstream processes. The final mate pair library consists of fragments made up of
two DNA segments that were originally separated by 2-5kb. The mate pair library is
hybridized and amplified onto a flow cell followed by paired-end sequencing.
These long-insert Paired-End libraries are useful for a number of applications, including De
Novo Sequencing, genome finishing, and structural variant detection. Combining data
generated from Mate Pair library sequencing with that from short-insert paired-end reads
provides a powerful combination of read lengths for maximal genomic sequencing coverage
across the genome.
Mate pairs are also typically used to discover structural variants (SVs) regions of the
genome that have undergone large-scale mutations, such as inversions and large insertions
and deletions known as INDELS. Mate pair is more relevant in genome assembly, especially
for covering repetitive sequences [5].
Below is figure which explains steps in paired end and mate sequencing, the difference
between both methods is that mate pair end uses e specific type of libraries (biotinylated
labeled) and then it follows same steps as paired end sequencing. Mate pair allows you to
have your pairs be much farther apart, which can be more informative than the standard
paired-end protocol.
13. 13 | P a g e
Fig. 3 Paired end sequencing Fig. 4 Mate pair sequencing
1.3 TYPES OF SEQUENCING METHODS:
1.3.1. Pyrosequencing: This method of DNA sequencing is based on “sequencing by
synthesis” principle. The sequences are identified on basis of intensity of light emitted when
complimentary nucleotide incorporates to template. This reaction involves a single strand of
the DNA to be sequenced and then synthesizing its complementary strand enzymatically.
This method detects activity of DNA polymerase with another chemiluminescent enzyme
(luciferase and apyrase). It allows a single base to be incorporated at a time and detecting
which base was actually added. The template DNA is immobile, and solutions of A, C, G,
and T nucleotides are sequentially added and removed from the reaction. Light is produced
only when any one nucleotide complements the first unpaired base of the template. The
previous nucleotide is degraded before the next nucleotide is added for synthesis allowing
for the possible revealing of the next nucleotide via the resulting intensity of light.
14. 14 | P a g e
1.3.2. Sequencing by ligation:
This DNA sequencing method uses enzyme DNA ligase to identify the nucleotide present in
a given unknown DNA sequence. This method relies on the sensitivity of DNA ligase for
base-pairing mismatches. The DNA molecule to be sequenced is ssDNA flanked by known
sequence which is bound to oligonucleotide anchors labeled with fluorescent dyes. When
anchor molecule hybridizes to DNA, the DNA ligase joins the molecule to the anchor when
its bases match the unknown DNA sequence. Based on the fluorescence produced by the
molecule, one can infer the identity of the nucleotide at this position in the unknown
sequence. This hybridization is cleaved and again the same process is repeated.
1.3.3. Single molecule sequencing: Single molecule sequencing is a parallelized single
molecule DNA sequencing by synthesis technique. This sequencing utilizes zero mode wave
guide (an optical waveguide that guides light energy into a volume that is small in all
dimensions compared to the wavelength of the light) at the bottom of which a single DNA
polymerase with single stranded DNA as template. The ZMW is a structure that creates an
illuminated observation volume that is small enough to observe only a single nucleotide of
DNA (also known as a base) being incorporated by DNA polymerase. Each of the four DNA
bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated
by the DNA polymerase, the fluorescent tag is cleaved off and diffuses out of the
observation area of the ZMW where its fluorescence is no longer observable. A detector
detects the fluorescent signal of the nucleotide incorporation, and the base call is made
according to the corresponding fluorescence of the dye.
1.3.4. Nanopore DNA sequencing: A Nanopore is simply a small hole, of the order of 1
nanometer in internal diameter. Certain porous transmembrane cellular proteins act as
nanopores, and nanopores have also been made by etching a somewhat larger hole (several
tens of nanometers) in a piece of silicon, and then gradually filling it in using ion-beam
sculpting methods which results in a much smaller diameter hole: the nanopore. The theory
behind nanopore sequencing is that when a nanopore is immersed in a conducting fluid and
a potential (voltage) is applied across it, an electric current due to conduction of ions
through the nanopore can be observed. The amount of current is very sensitive to the size
15. 15 | P a g e
and shape of the nanopore. If single nucleotides (bases), strands of DNA or other molecules
pass through or near the nanopore, this can create a characteristic change in the magnitude
of the current through the nanopore.
1.3.5. Semiconductor Sequencing: Ion Semiconductor Sequencing is a method of DNA
sequencing based on the detection of hydrogen ions that are released during the
polymerization of DNA. This is a method of "sequencing by synthesis", during which a
complementary strand is built based on the sequence of a template stand. A microwell
containing a template DNA strand to be sequenced is flooded with a single species of
deoxyribonucleotide triphosphate (dNTP). If the introduced dNTP is complementary to the
leading template nucleotide, it is incorporated into the growing complementary strand. This
causes the release of a hydrogen ion that triggers an ISFET (ion-sensitive field-effect
transistor) ion sensor, which indicates that a reaction has occurred. If homopolymer repeats
are present in the template sequence, multiple dNTP molecules will be incorporated in a
single cycle. This leads to a corresponding number of released hydrogens and a
proportionally higher electronic signal.
1.4 TYPES OF NEXT-GENERATION SEQUENCING TECHNOLOGIES
1.4.1 RNA-seq: Is also called “Whole Transcriptome Shotgun Sequencing” a revolutionary
tool for transcriptomics, refers to as use of high-throughput sequencing technologies to
sequence cDNA in order to get information about cells RNA content. RNA sequencing has
emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of
transcripts provides an important solution to transcriptome analysis for organisms with no
reference genome. RNA-seq provides efficient ways to measure Transcriptome data
experimentally, allowing them to get information such as how different alleles of a gene are
expressed and detect post-transcriptional mutations or identify gene fusions.
1.4.2 Chip-seq: Also known as ChIP-sequencing, is used to analyze protein interactions
with DNA. Chip-seq combines chromatin immunoprecipitation (ChIP) with massive parallel
DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to
map global binding sites precisely for any protein of interest. It is also to determine how
16. 16 | P a g e
transcription factors and other chromatin-associated proteins influence phenotype-affecting
mechanisms. It determines how proteins interact with DNA to regulate gene expression and
is essential for understanding mechanism of biological processes and disease states.[10]
1.4.3 Bisulphite-seq: Is the use of bisulfate treatment of DNA to determine its pattern of
methylation. DNA methylation was the first discovered epigenetic mark, and remains the
most studied. In animals it predominantly involves the addition of a methyl group to the
carbon-5 position of cytosine residues of the dinucleotide CpG, and is implicated in
repression of transcriptional activity. [8]
1.4.4 Tn-seq: Tn-seq is used for accurately determining quantitative genetic interactions on
a genome-wide scale in microorganisms. Tn-seq is based on the assembly of a saturated
Mariner transposon insertion library. After library selection, changes in frequency of each
insertion mutant are determined by sequencing of the flanking regions. These changes are
used to calculate each mutant’s fitness. Due to the wide activity of the Mariner transposon,
Tn-seq has the potential to contribute to the exploration of complex pathways across many
different species [1].
1.5 APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING
1.5.1. The 1000 Genomes Project: More genomes need to be sequenced to learn how
genotype correlates with phenotype. A project to sequence 1000 human genomes has been
prepared, which will allow creation of a reference standard for the analysis of human
genomic variations that is expected to contribute to studies of disease and how genotype
correlates with phenotype. [7]
1.5.2. Targeted sequencing: currently we sample whole genome, which is wasteful if we
are interested in a particular genomic region. This approach will allow sequencing only
those portion of genome in which we are interested. [17]
1.5.3. Human Microbiome Project: Also called The Second Human Genome Project,
will focus on analyzing the collection of microbes in and on human body which will
17. 17 | P a g e
contribute in understanding human health and disease. Changes in microbial communities
in the body have been generally linked to immune system function, obesity and cancer. In
future, each individual’s microbiome could eventually become a medical biometric. [18]
1.5.4. Metagenomics Project: The novel sequencing technologies will be also useful in
microbial genomics, for example in the metagenomics measuring the genetic diversity
encoded by microbial life in organisms inhabiting a common environment.
An important application is planned by the US DOE Joint Genome Institute (JGI)
which will focus its sequencing efforts on new plant and microbial targets that may
be of use in the development of alternative energies.[19]
The JGI plans to sequence the genome of the marine red alga, which may play an
important environmental role in removing carbon dioxide from the atmosphere.
1.5.5. HapMap Project: This project aims to develop a Haplotype Map (HapMap) of
human genome which will describe common pattern of genetic variation in human. This
project will serve as resource to researchers to find genetic variants affecting health, disease
and responses to drugs and environmental factors. [20]
1.6 ANALYSIS OF RNA Seq DATA
RNA seq experiment results in very large data files. The data analysis involves complex
steps from fastq quality inspection to GO annotation (described later), which form a
pipeline.
For performing analysis on RNA-seq high throughput data, we need high end servers[centos
] for high RAM and fast computational speed.
Many tools, open source as well as commercial, exist for NGS data analysis. Commercial
tools for next generation sequencing include Avadis NGS by strand [16], CLCbio Genomics
Workbench [13], DNANexus [14], and GenomeQues [15]. At global level, many
universities and consortiums have created online as well as downloadable open source tools
for NGS data analysis.
Among the open source tools, R/Bioconductor based tools are very popular. As explained
below, R/Bioconductor provides a comprehensive framework consisting of thousands of
18. 18 | P a g e
libraries and tools covering the entire spectrum of bioinformatics data analysis methods.
Detailed description and performance of these algorithms and libraries have been published
in open source journals. We developed a pipeline using R/Bioconductor libraries to analyze
data from RNA seq experiments.
1.7 R AND BIOCONDUCTOR
R is an open source scripting language and environment for statistical computing and
graphics. R provides libraries for performing a wide variety of statistical and mathematical
computations like linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, Nueral Network and many more. Armed with inbuilt
graphical libraries, it is highly versatile and extensible. R provides an Open Source
environment supported by a very large number of communities providing applications in
various fields like mathematics, engineering, business mathematics, education and biology.
One of R's strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full
control.
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes
an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large, coherent, integrated collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on hardcopy, and
a well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
The term "environment" is intended to characterize it as a fully planned and coherent
system, rather than an incremental accretion of very specific and inflexible tools, as
is frequently the case with other data analysis software.
19. 19 | P a g e
R is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. For computationally-intensive tasks, C, C++ and
FORTRAN code can be linked and called at run time. Advanced users can write C code to
manipulate R objects directly.
1.7.1 BIOCONDUCTOR:
Bioconductor is an open development project, contributed by the global scientific
community. Within the framework of R package, developers create and add libraries for
specific applications following package guidelines to make it easier for others to use and
extend the software. Bioconductor [26,27] is an organized effort by the global biology
community that provides libraries and tools within the R framework for the comprehensive
analysis of data from bioinformatics experiments. Bioconductor uses the R statistical
programming language, and is open source and open development.
Bioconductor can import diverse sequence-related file types, including fasta, fastq, BAM,
gff, bed, and wig files, among others. Packages support common and advanced sequence
manipulation operations such as trimming, transformation, and alignment. Domain-specific
analyses include quality assessment, ChIP-seq, differential expression, RNA-seq, and other
approaches.
Bioconductor has extensive facilities for mapping between microarray probe, gene, pathway,
gene ontology, homology and other annotations. Bioconductor has built-in representations
of GO, KEGG, vendor, and other annotations, and can easily access NCBI, BiomaRt,
UCSC, and other sources. Bioconductor libraries make extensive use of R graphics facilities
for creating sophisticated plots required for NGS data display. Therefore, R/Bioconductor
framework is the natural choice for the developmental platform in our pipeline.
1.7.2 BIOCONDUCTOR PACKAGES USED IN THIS PIPELINE
Biostrings: The Biostrings package from Bioconductor provides an advanced environment
for efficient sequence management and analysis in R. It contains many speed and memory
effective string containers, string matching algorithms, and other utilities, for fast
20. 20 | P a g e
manipulation of large sets of biological sequences. The objects and functions provided by
Biostrings form the basis for many other sequence analysis packages [21].
ShortRead: The ShortRead package provides input, quality control, filtering, parsing, and
manipulation functionality for short read sequences produced by high throughput
sequencing technologies. While support is provided for many sequencing technologies, this
package is primarily focused on Solexa/Illumina reads [22].
GoSeq: Detects Gene Ontology or other user defined categories which are over/under
represented in RNA-seq data. We can obtain all gene ontology (GO) categories associated
with a set of genes using the relevant organism package. GoSeq is a package for performing
Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce
complexity and highlight biological processes in genome-wide expression studies, but
standard methods give biased results on RNA-seq data due to over-detection of differential
expression for long and highly expressed transcripts. Application of GoSeq to a prostate
cancer data set shows that GoSeq dramatically changes the results, highlighting categories
more consistent with the known biology [23].
SRAdb: High throughput sequencing technologies have very rapidly become standard tools
in biology. The data that these machines generate are large, extremely rich. As such, the
Sequence Read Archives (SRA) has been set up at to store these data in public repositories
in much the same spirit as microarray databases like NCBI GEO and EBI ArrayExpress.
Accessing data in SRA requires finding it first and this R package provides a convenient and
powerful framework to do that. In addition, SRAdb features functionality to determine
availability of sequence files and to download files of interest [24].
BiomaRt package: In recent years a huge number of biological database have been
available in public repositories. Easy access to these valuable data resources and firm
integration with data analysis is needed for comprehensive bioinformatics data analysis.
This package provides an interface to a growing collection of databases implementing the
BiomaRt software suit. The software package enables retrieval of large amount of data in a
21. 21 | P a g e
uniform way without the need to know the underlying database schemas or write complex
SQL queries. Examples of BiomaRt databases are Ensembl, Uniprot and HapMap.
These major databases give biomaRt users direct access to a diverse set of data and enable a
wide range of powerful online queries from R. BiomaRt databases can contain several
datasets, for Ensembl every species is a different dataset [25].
22. 22 | P a g e
CHAPTER II
2.1 BACKGROUND
Prostate cancer illumina NGS data is analyzed using R-statistical package. Short reads of
normal and cancer cells of prostate were retrieved from NCBI SRA with accession number
SRX022060, SRX022061, SRX022063, SRX022080, SRX022081 and SRX022083[28].
These SRA reads are in fastq format with base call and assigned probability (phred score).
Converting these fastq files to SAM format using Bowtie to generate counts file. These
counts file will be utilized as input file for differential expression analysis. In background
we will see file formats, assembly methods, assembly algorithm and mapping algorithm.
2.2. FILE FORMATS
1.1 FASTQ: FASTQ has emerged as a common file format for sharing sequencing read
data combining both the sequence and an associated per base quality score. Ii is s a test
based format for storing biological sequence obtained from NGS. Both nucleotides and
score are encoded with a single ASCII character. It has become the de facto standard format
for storing the output of high throughput sequencing instruments such as illumina Genome
Analyzer.
A FASTQ file normally uses four lines per sequence. Line 1 begins with a '@' character and
is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally
followed by the same sequence identifier (and any description) again. Line 4 encodes the
quality values for the sequence in Line 2, and must contain the same number of symbols as
letters in the sequence.
@HWUSI-EAS582_157:6:1:1:1501/1
NCACAGACACACACGAACACACAAAGACATGCCCATATGAAGAT
+
%.7786867:778556858746575058873/347777476035
23. 23 | P a g e
1.2 SAM: SAM stands for Sequence Alignment/Map format is a TAB-delimited text format
consisting of a header section, which is optional, and an alignment section. If present, the
header must be prior to the alignments. Header lines start with `@', while alignment lines do
not. Each alignment line has 11 mandatory fields for essential alignment information such as
mapping position, and variable number of optional fields for flexible or aligner specific
information.
1.3 BAM: BAM is a compressed binary version of SAM format, a compact and indexable
representation of nucleotide sequence alignments. For more convenience Bam files can be
converted into BAI files which are indexed BAM files.
2.3. ASSEMBLY:
Once sequencing reads have been produced, it is necessary to align them in a coherent
manner. The assembler detects reads which are consistently aligning with each other, thus
forming contiguous sequence known as contigs. Assembler attempts to arrange all the
contigs by their overlapping ends. Sets of contigs which can all be placed together in the
same region are sometimes called supercontigs or scaffolds.
2.1. De novo ASSEMBLY:
De novo assembly means assembling short reads without any reference genome by utilizing
knowledge hidden in short reads i.e. the details of their overlap. This overlapping property is
used by the algorithms to from contiguous sequence which can be mapped or aligned to
genome of interest to deduce information of that contigs. Various algorithms have been
developed to link such overlapping reads.
2.2 Reference-based assembly:
A reference genome (also known as a reference assembly) is a digital nucleic acid
sequence database, assembled by scientists as a representative example of a species' set of
genes. As they are often assembled from the sequencing of DNA from a number of donors,
reference genomes do not accurately represent the set of genes of any single individual.
Instead a reference provides a haploid mosaic of different DNA sequences from each donor.
24. 24 | P a g e
Usually, a genome is chosen as the reference only if the similarity between it and the target
genome is close to 100%. This restriction leads to quite limited application of the
comparative assembly. In our study we choose NCBI36 Hg18 as reference most of them
aligned to reference but some of them were rejected.
2.4. ASSEMBLY ALGORITHMS:
There are two basic approaches in algorithms for short-read assemblers: overlap graphs and
de Bruijn graph.
2.4.1 OVERLAP GRAPH: Most assemblers that were developed for Sanger reads follow
the overlap-layout-consensus paradigm. They compute all pair-wise overlap between reads
and store this information as a graph. Each node in the graph corresponds to a read and an
edge denotes an overlap between two reads. The overlap graph is used to compute a layout
of reads and consensus sequence of contigs. This method works best when there is limited
number of reads with significant overlap. Some ngs assembler use this technique but this
method is computationally expensive because large number of reads make overlap graph
very large. [Fig.5 ][11]
2.4.2 de Bruijn GRAPH: As overlap graphs do not scale with increasing number of reads,
most of ngs assembler use de Bruijn graphs. De Bruijn graphs reduce the computational
effort by breaking reads into smaller sequences of DNA, called k-mers where k denotes the
length in bases of these sequences. The de Bruijn graph finds overlaps of k-1 length between
these k-mers and not between the actual reads. The maximum efficient k-mer size for a
particular assembly is determined by the read length as well as error rate. The value of
parameter k has significant influence on the quality of assembly. Estimate of good values
can be made before assembly, but often the optimal value is best found by testing a small
range of values. Another property of de Bruijn it is that repeats in the genome can be
collapsed in graph and do not lead to many overlaps, although this doesn’t mean that they
can be more bridged or resolved [fig.5 ] [26].
25. 25 | P a g e
Fig. 5. Overlap graph and de bruijn graph
2.5. Mapping:
Genome mapping is assigning/locating of a specific gene to particular region of a
chromosome and determining the location of and relative distances between genes on the
chromosome. One of the most basic tasks in NGS analysis is the alignment of reads to either
a reference genome or transcriptome.
There are two major algorithmic approaches to map RNA-seq reads to a reference
transcriptome. The first, to which we collectively refer as ‘unspliced read aligners’ align
reads to a reference without allowing any large gaps. The unspliced read aligners fall into
two main categories, ‘Seed methods’ and ‘Burrows-Wheeler transform methods’.
2.5.1. Seed methods such as mapping and assembly with quality (MAQ) and Stampy find
matches for short subsequences, termed ‘seeds’, assuming that at least one seed in a read
will perfectly match the reference. Each seed is used to narrow candidate regions where
more sensitive methods (such as Smith-Waterman) can be applied to extend seeds to full
alignments [1].
2.5.2. In contrast, the second approach includes Burrows-Wheeler transform methods
such as Burrows-Wheeler alignment (BWA) and Bowtie, which compact the genome into a
data structure that is very efficient when searching for perfect matches. When allowing
26. 26 | P a g e
mismatches, the performance of Burrows-Wheeler transform methods decreases
exponentially with the number of mismatches as they iteratively perform perfect searches.
Unspliced read aligners are ideal for mapping reads against a reference cDNA databases for
quantification purposes. If the exact reference transcriptome is available, Burrows-Wheeler
methods are faster than seed-based methods. In contrast, when only the reference
transcriptome of a distant species is available, ‘seed methods’ can result in a large increase
in sensitivity [1].
2.6 DEFINITION OF TERMS:
MPSS: Massive parallel sequencing encompasses several high-throughput approaches to
DNA sequencing; it is also called next-generation sequencing (NGS) or second-generation
sequencing.
Deep sequencing: Depth in DNA sequencing refers to the number of times a nucleotide is
read during the sequencing process. Deep sequencing indicates that the coverage, or depth,
of the process is many times larger than the length of the sequence under study. The term
"deep" has been used for a wide range of depths (>7x) and the newer term "ultra-deep" has
appeared in the scientific literature to refer to even higher coverage (>100x).
Coverage: Coverage is the average number of reads representing a given nucleotide in the
reconstructed sequence.
Contigs: A contigs is a contiguous, overlapping sequence read resulting from the
reassembly of the small DNA fragments generated by sequencing. Contigs refers to the
overlapping clones that form a physical map of the genome that is used to guide sequencing
and assembly. Contigs can thus refer both to overlapping DNA sequence and to overlapping
physical segments (fragments) contained in clones depending on the context.
Supercontigs: A supercontig, also known as a super or a scaffold, is the largest type of
object in an assembly. A supercontig consists of one or more contigs bound together. The
27. 27 | P a g e
supercontig object includes information about the reads and contigs used to generate it, as
well as quality scores for each base.
Scaffolding: The process of creating supercontigs from contigs is called scaffolding.
N50 Value : The N50 statistic is a measure of the average length of a set of sequences, with
greater weight given to longer sequences. It is used widely in genome assembly, especially
in reference to contig lengths within a draft assembly. Given a set of sequences of varying
lengths, the N50 length is defined as the length N for which half of all bases in the
sequences are in a sequence of length L < N.
28. 28 | P a g e
CHAPTER III
3.1 AIM AND OBJECTIVES
Next Generation Sequencing is able to generate huge amounts of DNA sequence reads and
the major challenge is to handle such a large data efficiently. In this work we aim to develop
a method exploiting all available information to accurately align as many as possible spliced
sequence reads to the genome.
The data contains not only the DNA sequence of the read and the genome, but also quality
information associated with the read and predictions about potential splice sites within the
genome. The pipeline will produce some plots regarding statistics of reads and contigs. In
our work we extend the analysis method to also benefit from the read’s quality score. We
also removed bad quality base calls from reads in by trimming fastq file and found better
alignment with genomic regions. This information can help to decide at which positions one
can expect to observe mismatches and subsequently contribute to the identification of the
correct alignment.
In our work we used R package to perform powerful statistical methods to carry out data
processing for analyzing differential expression analysis, isoform, small RNA profiling. We
also analyzed short reads to detect whether we can perform de-novo assembly using RNA
data. We designed a fully functional automated pipeline which uses Bioconductor libraries
to analyze HTS data. Analysis can be carried on various statistical methods such as negative
binomial, Bayesian and exact test. We assembled reads both de-novo and by mapping to
genome. After de novo assembly we analyzed contigs for various biological mechanisms
such as intron retention, alternative splicing etc. In second method we mapped using bowtie
and aggregated reads count which were uniquely mapped to genome to find differentially
expressed genes.
This pipeline will also be annotating reads and will provide information regarding which
biological pathway they belong and to which portion they interact. BiomarRt package is
used for annotation purpose and for describing KEGG pathway. The flowchart of the
pipeline is given in Figure [3]. We will now describe each component in detail.
29. 29 | P a g e
FLOW CHART OF PIPELINE
Fig.6. Flow Chart Of Pipeline
Reference Based
Alignment Using Bowtie
Fastq files
Q.S. Analysis on
Short Reads
Trimming Low Quality
Reads
Generating HTML
report for reads
De novo
assembly: Velvet
Blastn using
Standalone blast
Comparing Blast results
of cancer & normal
Generating Expression
File using SAM file
Performing DGE
analysis
Analyzing GO &
KEGG Pathway
Analyzing statistics
of contigs file
30. 30 | P a g e
3.2 METHODOLOGY
3.2.1. RETRIEVING FASTQ SEQUENCES
High throughput sequencing technologies have very rapidly become standard tools in
biology.
The data that these machines generate are large, extremely rich. As such, the Sequence Read
Archives (SRA) have been set up at NCBI CBI GEO does often contain aligned reads for
sequencing experiments and the SRAdb package can help to provide links to these data.
Command in R to get SRA files:
3.2.2. FASTQ QUALITY INSPECTION USING ShortRead PACKAGE
Analysis of short reads is necessary to know the further strategy i.e. whether we should
perform de novo assembly or we should perform mapping with reference genome.
Analysis using ShortRead package gives quality information if quality score of sequences
are less than 20 we will remove those bases by trimming. It also inspects read yield, base
composition, most common base and plot per-cycle quality.
3.2.3. DE-NOVO ASSEMBLY USING VELVET
Using velvet assembler for De novo assembly of sequenced DNA but can also be used for
de novo assembly of transcriptomic sequence. De novo assembly of short sequence reads
into transcripts allows to reconstruct the sequences of full transcriptome, identify and lists
all expressed genes, separate isoforms, and capture the expression levels of transcripts.
Velvet, a program specially developed for de novo transcriptome assembly from short-read
RNA-Seq data. Velvet is generally used for assembly of bacterial genome but is also capable
of performing de novo assembly of mammalian genome. Velvet construct de Bruijn graph
library (SRAdb)
getFastq(in_acc = c("SRR000648", "SRR000657"),sra_con = sra_con, destdir = getwd())
sra_con <- dbConnect(SQLite(), sqlfile)
sra_con <- dbConnect(SQLite(), "SRAmetadb.sqlite")
31. 31 | P a g e
from large amounts of short-read sequences, then used an enumeration algorithm to score all
possible paths and branches, and retained those plausible ones as transcripts/isoforms.
Velvet is specially programmed to recover paths supported by actual reads and remove
ambiguous/erroneous edges, thus ensuring correct transcript reconstruction.
Command :
Hash length: 31
Input file: .Fastq Output: contigs.fa
3.2.4. ANALYZING STATISTICS OF CONTIGS FILE:
Statistical analysis of contigs file is necessary to know the quality of contigs produced by de
novo assembler is of any importance, whether the contigs aligned are of good length with
good quality score. Statistical analysis is an important step while performing de novo
assembly as it reveals statistical significance that contigs produced can be used for further
analysis or we should map the reads with some reference genome.
We got plots named below:
Histogram, weighted histogram and dinucleotide Frequency
3.2.5. PERFORMING STANDALONE BLAST
After performing and analyzing velvet output we carried out mapping of genomic segments
(i.e. contigs) to refseq database using standalone Blast. First of all we downloaded refseq
fasta file from NCBI and formatted them to be used as database.
Command:
For Buiding Database:
Makeblastdb –in <fasta_file> -dbtype –out <output_db_filename>
For Performing Blast:
Blastn –query <fasta_file> -db <database_name> -out <output_file>
./velveth output_directory hash_length [[-file_format][-read_type] filename]
./ velvetg output_directory coverage_cutoff
32. 32 | P a g e
FOR BUILDING DATABASE:
Input file: .fasta Output file: index file (.ewt)
FOR BLAST:
Input file: .fasta Output file: text file
We performed blastn using NCBI refseq as database and certain parameters to get top hits,
query name, sequence length matched, e-value and percent identity. In order to make strict
matching we fixed percent identity to 80%. Input file was .fastq file which was obtained
from velvet as contigs.
After performing blastn for both normal and cancer contigs we matched output text file with
each other on basis of mapped segment id.
3.2.6. COMPARING BLAST RESULTS TO FIND INTRON RETENTION
After getting mapped file for normal prostate and cancer prostate we took out those
segments which were having identical mapped refseq id. We manually analyzed both normal
and cancer contigs mapped to sequence of mapped refseq id and found mapping difference
between normal and cancer prostate contigs.
3.2.7. PERFORMING MAPPING USING BOWTIE
In another strategy we performed mapping of fastq file of normal and cancer prostate using
bowtie i.e. performing assembly of short reads using NCBI36 cDNA as reference genome.
We mapped short reads in fastq file with reference genome by allowing only 2 mismatches.
The output of bow tie is SAM file which contains reads information, portion of genome to
which read has aligned, start and end position and number of times in aligned. The bowtie
output is used to generate count file which will be having sequence id and number of counts
it mapped to genome.
Command:
bowtie -q -v 2 –sam <database_file_name> <fastq_file_name> <sam_output_filename>
33. 33 | P a g e
Input file : fastq file Output file: SAM file
3.2.8. GENERATING EXPRESSION FILE
After performing bowtie we got SAM file as bowtie output which contains information
regarding mapping position, number of time reads mapped to genome, transcript/gene id,
etc. We used R script to extract count for each reads and sorted them according to transcript
id. This generated count file is used as input by various BIOCONDUCTOR packages for
differential analysis of expression level.
3.2.9. USING BIOMART
We used biomaRt to make a database file of Ensembl gene id and matching transcript id in
order to replace mapped transcript id obtained from SAM file.
3.2.10. GENE ONTOLOGY USING goSeq:
This package provides methods for performing Gene Ontology analysis of RNA-seq data,
taking length bias into account. In order to perform a GO analysis of RNA-seq data, goSeq
only requires a simple named vector, which contains two pieces of information.
1. Measured genes: all genes for which RNA-seq data was gathered for your experiment.
Each element of your vector should be named by a unique gene identifier.
2. Differentially expressed genes: each element of your vector should be either a 1 or a 0,
where 1 indicates that the gene is differentially expressed and 0 that it is not. If the
organism, gene identifier or category test is currently not natively supported by goSeq, it
will also be necessary to supply additional information regarding the genes length and/or the
association between categories and genes such as gene id or gene symbol.
By using this package we annotated gene id’s which were selected as differentially
expressed according to their p-value. This package also helps us to know the pathway
information of genes.
34. 34 | P a g e
CHAPTER IV
RESULTS OF A REPRESENTATIVE ANALYSIS
We tested our pipeline by RNA seq Prostate cancer data with SRA accession number
SRX022060, SRX022061, SRX022063, SRX022080, SRX022081 and SRX022083[28] and
below are plots, expression profiling results and GO terms obtained as output of pipeline.
4.1. FASTQ QUALITY INSPECTION
4.1.1 OVERALL READ QUALITY:
Fig.7 Overall Read Quality
Lanes with consistently good quality reads have strong peaks at the right of the panel. Most
of reads are above QS (Quality Score) 20 they can be considered as good quality reads. We
can trim low quality reads by putting a cutoff below 10 because when we trimmed reads
with QS less than 20 we obtained less number of contigs as some of eliminated reads were
needed for filling gaps. We have analyzed QS for every fastq files and found a strong peak
after base call 20.
35. 35 | P a g e
4.1.2 PER-CYCLE QUALITY SCORE:
Fig. 8 Per-Cycle Quality Score
Reported quality scores are ‘calibrated’ i.e. incorporating phred-like adjustments following
sequence alignment. These typically decline with cycle, in an accelerating manner. Abrupt
transitions in quality between cycles toward the end of the read might result when only some
of the cycles are used for alignment: the cycles included in the alignment are calibrated more
effectively than the reads excluded from the alignment. Thus as number of cycles increases
the quality score falls.
The reddish lines are quartiles (solid: median, dotted: 25, 75), the green line is the mean.
Shading is proportional to number of reads.
36. 36 | P a g e
4.1.3 READ DISTRIBUTION
Fig. 9 Read distribution
These curves show how coverage is distributed amongst reads. Ideally, the cumulative
proportion of reads will transition sharply from low to high. Portions to the left of the
transition might correspond roughly to sequencing or sample processing errors, and
correspond to reads that are represented relatively infrequently. 10-15% of reads fall under
this category. To the right of transition reads are over represented than expected which may
be due to sequenced primer or adapter sequences, sequencing or base calling artifacts (e.g.,
poly-A reads), or features of the sample DNA (highly repeated regions) not adequately
removed during sample preparation. About 5% of reads fall under this category.
Broad transitions from low to high cumulative proportion of reads may reflect sequencing
bias or (perhaps intentional) features of sample preparation resulting in non-uniform
coverage.
Common duplicate reads might provide clues to the source of over-represented sequences.
Some of these reads are filtered by the alignment algorithms; other duplicate reads might
point to sample preparation issues.
37. 37 | P a g e
4.1.4 CYCLE-SPECIFIC BASE CALLS AND READ QUALITY
Fig. 10 Cycle-Specific Base Calls And Read Quality
Per-cycle base call should usually be approximately uniform across cycles. Quality of A
increases as number of cycle increases and quality of T decreases as number of cycles
increases. Quality after 10 cycles remains uniform and the base call for each four bases are
stable we can rely on base call when number of cycles is more.
38. 38 | P a g e
4.1.5 PER CYCLE READ QUALITY
Fig.11 Per Cycle Read Quality
Per cycle read quality plot shows that the fred quality of reads decreases as the number of
cycle increases. Top line of each box represent quartile which is uniform in overall
sequencing which shows that these reads can be used for de novo assembly. Quartile is a
important factor in deciding whether we should perform de novo assembly or not. By
evaluating this plot we can remove the bad reads from fastq by trimming the short reads
with Fred score below a desired cutoff, but practically this cutoff should not exceed value of
20 as this will remove some reads which were acting as bridge in between short reads.
39. 39 | P a g e
4.2. ANALYZING STATISTICS OF CONTIGS FILE
4.2.1 Histogram of contigs coverage Weighted histogram of contigs coverage
Fig. 9 Histogram and weighted histogram of contigs coverage
Above histograms show the coverage of contigs for RNA-seq of data of 3 normal and 3
cancer samples taken from NCBI SRA [28]. In the weighted histogram on left side low
coverage is not observed and all contigs are of good coverage.
4.2.2Dinucleotide frequency:
Fig 13 Dinucleotide frequency
This plot describes dinucleotide frequency in samples.
40. 40 | P a g e
4.2.3 CONTIGS N50 AND MAXIMUM LENGTH:
Sample
Name
N50 Max_contig_length
SRX022060 153 2895
SRX022061 152 3116
SRX022063 143 3088
SRX022080 163 3426
SRX022081 171 3512
SRX022083 158 3468
N50 is contig length such that using equal or longer contigs produces half the bases of the
genome. Max length is longest contig obtained by velvet assembler.
4.3. ANALYZING DIFFERENTIAL GENES EXPRESSION
4.3.1Top Tags From DGE analysis:
Comparison of groups: normal-cancer
Gene id logFC logCPM PValue
ENSG00000100285 -14.657624 11.619148 0.0003688800
ENSG00000044574 -14.203956 10.415779 0.0008230058
ENSG00000211896 -11.635115 8.851908 0.0023336360
ENSG00000126709 -10.389798 7.864545 0.0045019755
ENSG00000187244 -11.650946 7.725334 0.0049487220
ENSG00000215034 9.636823 6.693072 0.0097990435
ENSG00000211893 -9.214604 6.685765 0.0098388364
ENSG00000211677 -9.223101 6.519094 0.0110031817
ENSG00000211892 -9.013760 6.319759 0.0125532529
ENSG00000101439 -9.689559 6.124585 0.0143609705
Top tags are those differentially expressed gene which rejected null hypothesis with PValue
more than 0.05 i.e. with 95% confidence interval these genes have been differentially
expressed in cancer than in normal.
4.3.2 DIFFERENTIALLY EXPRESSED GENES:
0 1
19646 20
41. 41 | P a g e
0 represents for non differentially expressed and 1 for differentially expressed in groups Normal-
Cancer.
4.3.3 GO TERMS RETRIEVED BY GOSEQ PACKAGE
GOID: GO:0010466
Term: negative regulation of peptidase activity
Ontology: BP
Definition: Any process that stops or reduces the rate of peptidase
activity, the hydrolysis of peptide bonds within proteins.
--------------------------------------
GOID: GO:0051346
Term: negative regulation of hydrolase activity
Ontology: BP
Definition: Any process that stops or reduces the rate of hydrolase
activity, the catalysis of the hydrolysis of various bonds.
Synonym: down regulation of hydrolase activity
Synonym: down-regulation of hydrolase activity
Synonym: downregulation of hydrolase activity
Synonym: hydrolase inhibitor
Synonym: inhibition of hydrolase activity
--------------------------------------
GOID: GO:0004866
Term: endopeptidase inhibitor activity
Ontology: MF
Definition: Stops, prevents or reduces the activity of an
endopeptidase, any enzyme that hydrolyzes nonterminal peptide bonds
in polypeptides.
Synonym: alpha-2 macroglobulin
Synonym: endoproteinase inhibitor
Synonym: proteinase inhibitor
--------------------------------------
42. 42 | P a g e
GOID: GO:0030414
Term: peptidase inhibitor activity
Ontology: MF
Definition: Stops, prevents or reduces the activity of a peptidase, any
enzyme that catalyzes the hydrolysis peptide bonds.
Synonym: protease inhibitor activity
--------------------------------------
GOID: GO:0052547
Term: regulation of peptidase activity
Ontology: BP
Definition: Any process that modulates the frequency, rate or extent of
peptidase activity, the hydrolysis of peptide bonds within
proteins.
Synonym: peptidase regulator activity
--------------------------------------
GOID: GO:0043086
Term: negative regulation of catalytic activity
Ontology: BP
Definition: Any process that stops or reduces the activity of an
enzyme.
Synonym: down regulation of enzyme activity
Synonym: down-regulation of enzyme activity
Synonym: downregulation of enzyme activity
Synonym: inhibition of enzyme activity
Synonym: negative regulation of enzyme activity
--------------------------------------
GOID: GO:0051336
Term: regulation of hydrolase activity
Ontology: BP
Definition: Any process that modulates the frequency, rate or extent of
hydrolase activity, the catalysis of the hydrolysis of various
43. 43 | P a g e
bonds, e.g. C-O, C-N, C-C, phosphoric anhydride bonds, etc.
Hydrolase is the systematic name for any enzyme of EC class 3.
Synonym: hydrolase regulator
GOID: GO:0006952
Term: defense response
Ontology: BP
Definition: Reactions, triggered in response to the presence of a
foreign body or the occurrence of an injury, which result in
restriction of damage to the organism attacked or
prevention/recovery from the infection caused by the attack.
Synonym: antimicrobial peptide activity
Synonym: defence response
Synonym: defense/immunity protein activity
Synonym: physiological defense response
Synonym: GO:0002217
Synonym: GO:0042829
Secondary: GO:0002217
Secondary: GO:0042829
--------------------------------------
GOID: GO:0061134
Term: peptidase regulator activity
Ontology: MF
Definition: Modulates the activity of a peptidase, any enzyme that
catalyzes the hydrolysis peptide bonds.
--------------------------------------
GOID: GO:0061135
Term: endopeptidase regulator activity
Ontology: MF
Definition: Modulates the activity of a peptidase, any enzyme that
hydrolyzes nonterminal peptide bonds in polypeptides.
44. 44 | P a g e
CHAPTER V
CONCLUSIONS:
This pipeline performs some initial statistical analysis which will help in our understanding
of short reads and will pave a path for further analysis such as quality trimming, de novo
assembly and mapping. Fastq quality inspection will allow us to inspect reads and remove
the bad base call, it also suggest contamination if present. In “per cycle read quality” plots if
the quartile is not uniform then we are not supposed to do de novo assembly. Analysis of
stat file obtained from velvet gives histogram and weighted histogram of coverage which
shows low coverage region if present. If any low coverage regions are found they can be
removed by setting a cutoff slightly more than mean of weighted histogram which will
remove low coverage region.
We have analyzed Prostate cancer data vs. normal data for testing performance of pipeline.
By fastq quality inspection we concluded that the reads have good quality with some adapter
contamination. Adapter contamination may interfere in velvet assembly. We found by the
analysis that the reads are suitable for de novo assembly.
We analyzed blast results and found intron retention in Homo sapiens kallikrein-related
peptidase 3 with gi|225543369.
In further analysis the pipeline performs mapping of short reads using bowtie, on an average
70% of short reads mapped with NCBI36 Hg18.
Pipeline performs DGE analysis and gives top 10 most differentially expressed genes
according to p-value less than 0.05 i.e. these top 10 genes disproved null hypothesis by 95%
confidence interval. After getting DGE, pipeline performs Gene Ontology analysis on
differentially expressed genes for getting GO related terms.
45. 45 | P a g e
REFERENCES
[1] Elaine R. Mardis , “Next-Generation DNA Sequencing Methods” Departments of
Genetics and Molecular Microbiology and Genome Sequencing Center, Washington
University School of Medicine, St. Louis
[2] “A haplotype map of the human genome”, The International HapMap Consortium,
Nature 437, 1299-1320 (27 October 2005) | doi:10.1038/nature04226; Received 11 August
2005; Accepted 12 September 2005
[3] Figure for decreasing cost From: National Human Research Institute
[4]Short note on base calling, http://www.ebi.ac.uk/industry/Documents/workshop-
materials/newsequence291009/Basecalling-Klaus_Maisinger.pdf
[5] http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn
[6] http://www.illumina.com/technology/mate_pair_sequencing_assay.ilmn
[7] 1000 Genomes Project reveals human variation.
http://www.nature.com/news/2010/101027/full/news.2010.567.html
[8] http://www.biomedcentral.com/1471-2105/10/232
[9] Tim van Opijnen, Kip L. Bodi, and Andrew Camilli,“Tn-seq; high-throughput parallel
sequencing for fitness and genetic interaction studies in microorganisms.”
[10] http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0006589
[11] A memory-efficient data structure representing exact-match overlap graphs with
application for next generation DNA assembly
http://bioinformatics.oxfordjournals.org/content/early/2011/06/02/bioinformatics.btr321
[12] http://manuals.bioinformatics.ucr.edu/home/ht-seq
[13] http://www.clcbio.com/index.php?id=1240
[14] https://dnanexus.com/
46. 46 | P a g e
[15] http://www.genomequest.com/
[16] http://www.strandsi.com/AvadisNGS
[17] http://www.raindancetechnologies.com/applications/next-generation-sequencing-
technology.asp
[18] http://www.nature.com/nature/journal/v449/n7164/full/nature06244.html
[19] http://www.nature.com/nature/journal/v453/n7195/full/453687a.html
[20] http://www.nature.com/nature/journal/v437/n7063/edsumm/e051027-01.html
[21] http://www.bioconductor.org/packages/2.9/bioc/html/Biostrings.html
[22 ] http://www.bioconductor.org/packages/2.9/bioc/html/ShortRead.html
[23] http://bioinf.wehi.edu.au/software/goseq/
[24] http://www.bioconductor.org/packages/2.9/bioc/html/SRAdb.html
[25] http://www.bioconductor.org/packages/2.2/bioc/html/biomaRt.html
[26,27] http://www.bioconductor.org/
http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
[28] Recurrent chimeric RNAs enriched in human prostate cancer identified by deep
sequencing. http://www.ncbi.nlm.nih.gov/pubmed/21571633