This presentation gives an easy introduction to genome assemblies from next-generation sequencing data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/#!genome-assembly/
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
The document discusses genome assembly from sequencing reads. It describes how reads can be aligned to a reference genome if available, but for a new genome the reads must be assembled without a reference. Two main assembly approaches are described: overlap-layout-consensus which builds an overlap graph, and de Brujin graph assembly which constructs a de Brujin graph from k-mers. Both approaches aim to find contiguous sequences (contigs) from the reads but face challenges from computational complexity and sequencing errors in the reads.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
The document discusses genome assembly from sequencing reads. It describes how reads can be aligned to a reference genome if available, but for a new genome the reads must be assembled without a reference. Two main assembly approaches are described: overlap-layout-consensus which builds an overlap graph, and de Brujin graph assembly which constructs a de Brujin graph from k-mers. Both approaches aim to find contiguous sequences (contigs) from the reads but face challenges from computational complexity and sequencing errors in the reads.
A workshop is intended for those who are interested in and are in the planning stages of conducting an RNA-Seq experiment. Topics to be discussed will include:
* Experimental Design of RNA-Seq experiment
* Sample preparation, best practices
* High throughput sequencing basics and choices
* Cost estimation
* Differential Gene Expression Analysis
* Data cleanup and quality assurance
* Mapping your data
* Assigning reads to genes and counting
* Analysis of differentially expressed genes
* Downstream analysis/visualizations and tables
Guest lecture on comparative genomics for University of Dundee BS32010, delivered 21/3/2016
Workshop/other materials available at DOI:10.5281/zenodo.49447
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
This document outlines methods for constructing biological networks. It discusses the basic components and features of networks, including nodes, edges, degree, distance, and types of biological networks. It describes approaches for building networks from genome sequences, "omics" data, and literature. These include predicting interactions from gene neighbor/order, co-evolution, co-expression analysis, and integrating data sources to build meta-networks representing entire genomes or phenotypes.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
This document provides a summary of a seminar on comparative genomics techniques. It discusses three levels of genome research: structural genomics, functional genomics, and comparative genomics. Comparative genomics involves analyzing and comparing different genomes to study gene content, function, organization, and evolution. Techniques discussed include genome sequencing, mapping, and bioinformatics tools. The document also outlines what can be compared between genomes and how comparative genomics has provided insights into evolution and gene function.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
This document outlines the process of performing whole exome sequencing (WES) analysis on an individual sample. It describes:
1) Downloading exome sequencing data for a sample from the Sequence Read Archive and running quality control using FastQC.
2) Using the SeqMule pipeline to align reads, call variants, and produce coverage plots.
3) Annotating variants found using wANNOVAR and filtering to exclude common variants.
4) The reader is then instructed to run the same analysis on a sample from the most sequenced individual, NA12878, to explore variants.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
Whole genome sequencing is the process of determining the complete DNA sequence of an organism's genome. It involves sequencing all chromosomal and organellar DNA. Key methods include shotgun sequencing, which randomly fragments DNA for sequencing, and single molecule real time sequencing, which observes individual DNA polymerases incorporating nucleotides in real time using fluorescent tags. Whole genome sequencing has provided insights into evolutionary biology and may help predict disease susceptibility, though technical challenges remain such as fully sequencing repetitive regions.
This document discusses genetic analysis techniques including gene mapping and quantitative trait locus (QTL) analysis. It provides an overview of the basic requirements for gene mapping such as near isogenic lines and saturated physical maps. It describes the 5 step process for map-based cloning including gene tagging, high resolution mapping, physical mapping, sequencing, and transformation. It also discusses QTL analysis including principal objectives, factors considered, and methods to detect QTL like single marker analysis and interval mapping. Applications of QTL analysis in plant breeding and pre-breeding are mentioned. The key differences between traditional QTL mapping and association mapping are highlighted.
The document summarizes genomic comparisons across different organisms. It discusses:
1) The first sequencing of bacterial genomes including Haemophilus influenzae, which has 1.8 million base pairs and 1,749 genes.
2) The sequencing of eukaryotic genomes including Saccharomyces cerevisiae, which has 16 chromosomes and approximately 5,885 protein-coding genes.
3) Key animal and plant genomes sequenced including Arabidopsis thaliana, Drosophila melanogaster, Mus musculus, and Homo sapiens. The human genome is approximately 3 billion base pairs long and contains around 30,000 genes.
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
The document discusses genome assembly and gene prediction from sequencing data. It describes how short DNA sequence reads are assembled into longer contiguous sequences or contigs. It also explains different approaches used for gene prediction, including ab initio prediction using statistical models, homology-based prediction using known genes from related organisms, and transcript-based prediction using cDNA or RNA-seq data. Key steps involve repeat masking, identifying open reading frames, and dealing with complications from introns in eukaryotic genomes. The challenges of gene prediction and determining which predictions are correct are also addressed.
This document discusses sequence alignment, which involves placing two or more biological sequences in an optimal alignment to identify regions of similarity and deduce evolutionary relationships. It defines key terms like similarity, identity, conservation, and optimal alignment. It also describes the rationale for alignment, which is to compare sequences and find similarities that provide insights into biological function and evolutionary history. Finally, it outlines different types of alignment like global, local, pairwise, and multiple sequence alignment.
Mapping and sequencing genomes: Genetic and physical mapping, Sequencing genomes different strategies, High-throughput sequencing, next-generation sequencing technologies, comparative genomics, population genomics, epigenetics, Human genome project, pharmacogenomics, genomic medicine, applications of genomics to improve public health.
The document discusses RNA-seq analysis. It begins with an introduction to Mikael Huss, a bioinformatics scientist, and provides an overview of how genomics, RNA profiles, protein profiles, and interactomics relate within systems biology. The document then discusses how gene expression analysis can provide insights into basic research questions regarding tissue and cell identity, as well as insights into diseases by identifying genes that are over- or under-expressed in patients. Finally, it provides a brief overview of the typical workflow for RNA-seq analysis, which involves mapping RNA sequencing reads to a reference genome or transcriptome.
Three groups annotated the genome of Mycoplasma genitalium and found inconsistencies in their annotations. Of the 468 genes, 318 were annotated consistently by all three groups but 45 had conflicting annotations. Errors likely arose from insufficient sequence similarity to determine homology accurately or incorrectly inferring function based on homology alone. Database curation is needed to prevent propagation of erroneous annotations.
This document discusses how mate-pair information can be used to effectively increase read lengths and overcome challenges posed by short read lengths in de novo genome assembly. It describes how the Eulerian assembly algorithm transforms mate-pair reads into long "mate-reads" by connecting the two reads in a pair via the estimated insert length. This has the effect of increasing the overall read length and improving assembly, to the point where read length may no longer be a major limitation as long as the span between mate-pairs is large enough. The repeat graph data structure enables easy processing of mate-pair information in this way.
This document summarizes the sequencing and analysis of the peanut genome. Key points include:
- The genomes of Arachis duranensis and Arachis ipaensis, the proposed ancestral species of cultivated peanut, were sequenced and assembled.
- Both genomes were around 1-1.5Gb in size and contained a high percentage of repetitive elements. Comparative analysis revealed extensive synteny but also regions of rearrangement between the two genomes.
- Over 36,000 and 41,000 genes were annotated in A. duranensis and A. ipaensis respectively. Gene families involved in disease resistance were expanded through duplication events.
- Transcriptome assembly of cultivated peanut aligned closely with the ancestral genomes
This document outlines methods for constructing biological networks. It discusses the basic components and features of networks, including nodes, edges, degree, distance, and types of biological networks. It describes approaches for building networks from genome sequences, "omics" data, and literature. These include predicting interactions from gene neighbor/order, co-evolution, co-expression analysis, and integrating data sources to build meta-networks representing entire genomes or phenotypes.
This document discusses different methods for genome sequencing and assembly, including restriction enzyme fingerprinting, marker sequences, and hybridization assays. It focuses on using marker sequences like sequence-tagged sites (STS), expressed sequence tags (ESTs), untranslated regions (UTRs), and single nucleotide polymorphisms (SNPs) to map genomes. Large-insert cloning vectors like BACs and PACs can be used with restriction enzyme fingerprinting and FPC software to assemble contigs and map genomes at a large scale. Marker sequences provide a dense set of physical markers to build accurate physical maps of genomes.
This document provides a summary of a seminar on comparative genomics techniques. It discusses three levels of genome research: structural genomics, functional genomics, and comparative genomics. Comparative genomics involves analyzing and comparing different genomes to study gene content, function, organization, and evolution. Techniques discussed include genome sequencing, mapping, and bioinformatics tools. The document also outlines what can be compared between genomes and how comparative genomics has provided insights into evolution and gene function.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
This document outlines the process of performing whole exome sequencing (WES) analysis on an individual sample. It describes:
1) Downloading exome sequencing data for a sample from the Sequence Read Archive and running quality control using FastQC.
2) Using the SeqMule pipeline to align reads, call variants, and produce coverage plots.
3) Annotating variants found using wANNOVAR and filtering to exclude common variants.
4) The reader is then instructed to run the same analysis on a sample from the most sequenced individual, NA12878, to explore variants.
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Microarray technology allows researchers to analyze gene expression levels on a genomic scale. DNA microarrays contain many genes arranged on a slide that can be used to detect differences in gene expression between samples. The microarray workflow involves sample preparation, hybridization of labeled cDNA to the array, image scanning, data normalization and statistical analysis to identify differentially expressed genes between conditions. Multiple testing is a challenge and statistical methods must account for false positives and negatives.
This document provides an introduction and overview of common methods for processing and analyzing next generation sequencing (NGS) data, including mapping NGS reads and de novo assembly of NGS reads. It discusses various NGS applications such as RNA-Seq, epigenetics, structural variation detection, and metagenomics. Key steps in read alignment such as choosing an alignment program and viewing alignments are outlined. Considerations for choosing an alignment program based on library type, read type, and platform are also reviewed. Popular alignment programs including Bowtie, BWA, TopHat, and Novoalign are mentioned.
Whole genome sequencing is the process of determining the complete DNA sequence of an organism's genome. It involves sequencing all chromosomal and organellar DNA. Key methods include shotgun sequencing, which randomly fragments DNA for sequencing, and single molecule real time sequencing, which observes individual DNA polymerases incorporating nucleotides in real time using fluorescent tags. Whole genome sequencing has provided insights into evolutionary biology and may help predict disease susceptibility, though technical challenges remain such as fully sequencing repetitive regions.
This document discusses genetic analysis techniques including gene mapping and quantitative trait locus (QTL) analysis. It provides an overview of the basic requirements for gene mapping such as near isogenic lines and saturated physical maps. It describes the 5 step process for map-based cloning including gene tagging, high resolution mapping, physical mapping, sequencing, and transformation. It also discusses QTL analysis including principal objectives, factors considered, and methods to detect QTL like single marker analysis and interval mapping. Applications of QTL analysis in plant breeding and pre-breeding are mentioned. The key differences between traditional QTL mapping and association mapping are highlighted.
The document summarizes genomic comparisons across different organisms. It discusses:
1) The first sequencing of bacterial genomes including Haemophilus influenzae, which has 1.8 million base pairs and 1,749 genes.
2) The sequencing of eukaryotic genomes including Saccharomyces cerevisiae, which has 16 chromosomes and approximately 5,885 protein-coding genes.
3) Key animal and plant genomes sequenced including Arabidopsis thaliana, Drosophila melanogaster, Mus musculus, and Homo sapiens. The human genome is approximately 3 billion base pairs long and contains around 30,000 genes.
An update version of the genome assembly including the mention of techniques such as HiC and Bionano. Also include the QC. These are the same slides used in the course for the UNL in Argentina.
The document discusses genome assembly and gene prediction from sequencing data. It describes how short DNA sequence reads are assembled into longer contiguous sequences or contigs. It also explains different approaches used for gene prediction, including ab initio prediction using statistical models, homology-based prediction using known genes from related organisms, and transcript-based prediction using cDNA or RNA-seq data. Key steps involve repeat masking, identifying open reading frames, and dealing with complications from introns in eukaryotic genomes. The challenges of gene prediction and determining which predictions are correct are also addressed.
This document discusses sequence alignment, which involves placing two or more biological sequences in an optimal alignment to identify regions of similarity and deduce evolutionary relationships. It defines key terms like similarity, identity, conservation, and optimal alignment. It also describes the rationale for alignment, which is to compare sequences and find similarities that provide insights into biological function and evolutionary history. Finally, it outlines different types of alignment like global, local, pairwise, and multiple sequence alignment.
Mapping and sequencing genomes: Genetic and physical mapping, Sequencing genomes different strategies, High-throughput sequencing, next-generation sequencing technologies, comparative genomics, population genomics, epigenetics, Human genome project, pharmacogenomics, genomic medicine, applications of genomics to improve public health.
The document discusses RNA-seq analysis. It begins with an introduction to Mikael Huss, a bioinformatics scientist, and provides an overview of how genomics, RNA profiles, protein profiles, and interactomics relate within systems biology. The document then discusses how gene expression analysis can provide insights into basic research questions regarding tissue and cell identity, as well as insights into diseases by identifying genes that are over- or under-expressed in patients. Finally, it provides a brief overview of the typical workflow for RNA-seq analysis, which involves mapping RNA sequencing reads to a reference genome or transcriptome.
Three groups annotated the genome of Mycoplasma genitalium and found inconsistencies in their annotations. Of the 468 genes, 318 were annotated consistently by all three groups but 45 had conflicting annotations. Errors likely arose from insufficient sequence similarity to determine homology accurately or incorrectly inferring function based on homology alone. Database curation is needed to prevent propagation of erroneous annotations.
This document discusses how mate-pair information can be used to effectively increase read lengths and overcome challenges posed by short read lengths in de novo genome assembly. It describes how the Eulerian assembly algorithm transforms mate-pair reads into long "mate-reads" by connecting the two reads in a pair via the estimated insert length. This has the effect of increasing the overall read length and improving assembly, to the point where read length may no longer be a major limitation as long as the span between mate-pairs is large enough. The repeat graph data structure enables easy processing of mate-pair information in this way.
This document summarizes the sequencing and analysis of the peanut genome. Key points include:
- The genomes of Arachis duranensis and Arachis ipaensis, the proposed ancestral species of cultivated peanut, were sequenced and assembled.
- Both genomes were around 1-1.5Gb in size and contained a high percentage of repetitive elements. Comparative analysis revealed extensive synteny but also regions of rearrangement between the two genomes.
- Over 36,000 and 41,000 genes were annotated in A. duranensis and A. ipaensis respectively. Gene families involved in disease resistance were expanded through duplication events.
- Transcriptome assembly of cultivated peanut aligned closely with the ancestral genomes
de Bruijn Graph Construction from Combination of Short and Long ReadsSikder Tahsin Al-Amin
This document describes the de Bruijn graph approach for genome assembly using a combination of short and long reads. It discusses key terminology, the motivation for this approach, and how an A-Bruijn graph is constructed and used to find the genomic path. It also covers error correction in draft genomes assembled using this method and potential areas for further development, such as calculating the likelihood ratio of different solid string sets and applying bridging and merging techniques.
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Anton Alexandrov
This document summarizes a genome assembly algorithm that combines de Bruijn graphs, overlap graphs, and microassembly. It first performs error correction on k-mers, then uses de Bruijn graphs to assemble quasicontigs. Overlap graphs are used in the initial contig assembly. Contig microassembly further joins contigs by analyzing paired reads that map between contigs. When applied to an E. coli dataset, this algorithm reduced the number and increased the size of contigs compared to another assembler.
O documento descreve o uso de grafos de De Bruijn coloridos para detectar variantes genéticas em múltiplas amostras simultaneamente. Os grafos de De Bruijn coloridos adicionam cores aos nós e arestas para representar diferentes amostras, permitindo a análise conjunta de vários genomas sem referência. Algoritmos como bubble-calling e path divergence são usados para identificar variantes observando divergências entre as cores. Isso permite a detecção precisa de variantes em qualquer espécie sem necessidade de uma referência adequada.
De Bruijn sequences are cyclic sequences where every possible substring of a given length appears exactly once. They have applications in areas like combination locks, indexing bits in computer words, and fault tolerant systems. De Bruijn sequences can be generated by finding a Hamiltonian path in a De Bruijn graph or by using shift registers to iteratively generate each digit. They provide a space-efficient way to enumerate all possible strings of a given length over a finite alphabet.
This presentation gives an introduction to analysing ChIP-seq data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
This lecture is part is an introductory bioinformatics workshop. It gives a background to what sequencing is, what the results of a sequencing experiment are, how to assess the quality of a sequencing run, what error sources exist and how to deal with errors. The accompanying websites are available at http://sschmeier.com/bioinf-workshop/
PCR and primer design require primers to meet certain criteria. Primers should be unique in the target sequence and non-unique in potential contaminants. The length should be 17-28 bases and the melting temperature between 52-65°C. The GC content should be around 50-60% and avoid secondary structures like hairpins or dimers. Computer programs can assist in optimizing primers. Advanced designs include universal primers for multiple targets and "guessmer" primers when sequences are unavailable.
This document provides an overview of DNA sequencing, sequence alignment, and sequence assembly. It discusses different sequencing technologies like Sanger, 454, Illumina, and Pacific Biosciences. It covers global, local, and glocal sequence alignment. It also describes various assembly algorithms like greedy assembly, overlap-layout-consensus assembly using De Bruijn graphs or Burrows-Wheeler transforms. Specific assemblers discussed include ABySS, Velvet, SGA, and others. The document provides examples of how assembly can find structural variants like deletions that are difficult to detect from read alignments alone.
Primer design is important for PCR, cDNA synthesis, and DNA sequencing. There are general guidelines for primer design such as 18-30 base pairs in length, 40-60% GC content, melting temperature between 55-66°C. Primers should avoid secondary structures and mismatches at the 3' end. Common programs for primer design are Primer-BLAST and Primer3, which check specificity and secondary structures. Manual design is used for universal primers from multiple sequence alignments. The primers are then checked and optimized before being ordered and used in PCR experiments.
This document discusses next generation sequencing technologies. It provides details on several massively parallel sequencing platforms and describes their advantages over traditional Sanger sequencing such as higher throughput, lower costs, and ability to process millions of reads in parallel. It then outlines several applications of next generation sequencing like mutation discovery, transcriptome analysis, metagenomics, epigenetics research and discovery of non-coding RNAs.
This document provides an overview of next generation sequencing (NGS) technologies. It discusses the history and evolution of DNA sequencing, from early manual methods developed by Sanger to modern high-throughput NGS approaches. Key NGS methods described include Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, 454 pyrosequencing, and SOLiD ligation sequencing. Compared to Sanger, NGS allows massively parallel sequencing of many samples at lower cost and higher throughput. While NGS has advanced biological research, each method still has advantages and limitations related to read length, accuracy, and cost.
This document provides an introduction to next generation sequencing (NGS) technologies. It begins with an outline of topics to be covered, including the evolution of NGS technologies, their descriptions and comparisons, bioinformatics challenges of NGS data analysis, and some aspects of NGS data analysis workflows and tools. The document then delves into explanations of specific NGS platforms, their performance characteristics, and the sequencing processes. It discusses the large computational infrastructure and data management needs of NGS, as well as quality control, preprocessing of NGS data, and popular analysis tools and workflows.
Multiple sequence alignment (MSA) involves aligning more than two biological sequences, such as DNA, RNA, or protein sequences. There are several approaches to MSA, including progressive alignment and greedy algorithms. Progressive alignment works by first constructing pairwise alignments between the most similar sequences and then adding other sequences in order of similarity. This is guided by a phylogenetic tree. Greedy algorithms iteratively combine the most similar pair of sequences or profiles into a new sequence/profile to reduce the alignment problem size. Scoring MSA involves quantifying how well-conserved each column is, such as through matches, entropy, or sum-of-pairs scores.
1. The document discusses connectivity preservation problems in graph theory and their parameterized complexity.
2. It presents new results showing that the problems of λ-connectivity deletion and biconnectivity deletion are fixed-parameter tractable when parameterized by the size of the edge deletion set k.
3. The key ideas involve showing that if a graph has many "non-critical" edges, then either the problem instance can be answered or reduced in size. Otherwise, the structure of critical edges can be exploited to find an irrelevant edge or solution.
This document summarizes a presentation given by Nesreen K. Ahmed on graph sampling techniques. It discusses previous work on sampling large graphs to estimate properties like triangle counts. Existing methods either require multiple passes over the data or make assumptions about the graph stream order. The presentation introduces a new single-pass Graph Priority Sampling framework that can estimate properties in an unbiased way using a fixed-size sample. It assigns edge weights and priorities to sample edges proportional to their contribution to graph structures. Estimates can be updated incrementally during the stream or retrospectively after it ends. The framework is evaluated on real-world graphs with billions of edges to estimate triangle counts, wedge counts, and clustering coefficients with low variance.
This document discusses the design of slender columns and two-way slabs. It provides an example problem for designing a long rectangular braced column. The steps include computing factored loads, determining the k value, checking for slenderness effects, calculating the required moment strength, and designing the reinforcement. It also compares one-way and two-way slab behavior, discussing different slab systems like flat plates, waffle slabs, and ribbed slabs.
This document provides an overview of graphs and graph algorithms. It begins with an introduction to graphs, including definitions of vertices, edges, directed/undirected graphs, and graph representations using adjacency matrices and lists. It then covers graph traversal algorithms like depth-first search and breadth-first search. Minimum spanning trees and algorithms for finding them like Kruskal's algorithm are also discussed. The document provides examples and pseudocode for the algorithms. It analyzes the time complexity of the graph algorithms. Overall, the document provides a comprehensive introduction to fundamental graph concepts and algorithms.
This document summarizes several graph algorithms, including:
1) Prim's algorithm for finding minimum spanning trees, which grows a minimum spanning tree by successively adding the closest vertex;
2) Dijkstra's algorithm for single-source shortest paths, which is similar to Prim's and finds shortest paths from a source vertex to all others;
3) An algorithm for all-pairs shortest paths based on repeated squaring of the weighted adjacency matrix, with multiplication replaced by minimization.
The document discusses the conjugate beam method for analyzing beams. It begins by defining the conjugate beam method and explaining that the load on the conjugate beam is equal to the bending moment diagram of the real beam divided by EI. It provides examples of how to determine the supports and loading of the conjugate beam based on the real beam. It then gives the procedure for using the method to find the slope and deflection at any point on the real beam, which involves solving the conjugate beam for reactions and cutting at the desired point to get the shear and moment values. Finally, it provides an example problem demonstrating the full procedure.
This document is a lecture on shear flow and shear stresses in beams and thin-walled members. It begins by defining shear flow in prismatic beams and relating it to the shear force and first moment of area. Examples are then given to calculate shear stresses in different beam cross-sections. The document extends these concepts to beams with arbitrary cross-sections and thin-walled members like flanges. It provides an example calculation of shear stress in the flange of a wide-flange beam. Diagrams and equations are presented throughout to illustrate the relationships between shear flow, shear stress, shear force, and moment of area.
This document provides information about graphs and graph algorithms. It discusses different graph representations including adjacency matrices and adjacency lists. It also describes common graph traversal algorithms like depth-first search and breadth-first search. Finally, it covers minimum spanning trees and algorithms to find them, specifically mentioning Kruskal's algorithm.
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)NAVER Engineering
발표자: 신기정(CMU 박사 과정)
발표일: 2018.1.
웹, 소셜 미디어, 리뷰 데이터 등 어디서든 쉽게 그래프 데이터를 찾아볼 수 있습니다. 그중 많은 데이터는 테라바이트 혹은 그 이상의 대용량이고, 계속 변화합니다. 또한, 많은 부가 정보를 포함하고 있어 텐서, 즉 다차원 행렬의 형태로 표현됩니다.
본 발표에서는 이러한 대용량의 동적인 그래프 및 텐서 데이터의 구조를 분석하고 이상점을 탐지하기 위한 알고리즘들을 소개합니다. 특히, 그래프 내의 삼각형의 개수를 정확하게 예측하는 기법과 비이상적으로 밀집된 지점을 탐색하는 기법을 중점적으로 소개할 예정입니다.
소개될 알고리즘들은 샘플링과 근사 기법을 활용하며 분산 환경과 스트림 환경에 적합하게 설계되었습니다. 또한, 소셜 미디어의 가짜 영향력자 탐색, 가짜 리뷰 탐색, 네트워크 공격 탐색 등에 적용될 수 있습니다.
- The document discusses geometric shapes and their applications in art, design, engineering and other fields. It provides examples of how triangles, quadrilaterals and other polygons are used.
- Geometric shapes like triangles are commonly used in bridge design because they are structurally strong. The document explores how truss bridges use triangular structures to support weight.
- Artists, animators, jewelers and other professionals rely on an understanding of geometry in their work. Computers now allow designers to experiment more easily with geometric forms.
This document section summarizes information about Euler paths and circuits as well as Hamilton paths and circuits. It discusses:
- Euler's solving of the Königsberg bridge problem and defining Euler paths/circuits
- Necessary and sufficient conditions for a graph to have an Euler path/circuit
- Algorithms for constructing Euler circuits
- Applications of Euler paths/circuits
- Defining Hamilton paths/circuits and discussing properties like Dirac's and Ore's theorems for Hamilton circuits
- Differences between Euler and Hamilton circuits
1) The document discusses inverse trigonometric functions such as arcsin, arccos, and arctan. It reviews special angle identities and defines the inverse trig functions.
2) Examples are provided to demonstrate evaluating inverse trig functions and writing equations in inverse form. The principal values of inverse trig functions are defined as being between -90 and 90 degrees.
3) The document contains examples of using inverse trig functions to find missing side lengths in 30-60-90 and 45-45-90 triangles based on special angle identities, as well as evaluating expressions involving inverse trig functions.
The document summarizes research on the parameterized complexity of the graph MOTIF problem. It begins by defining the problem and providing an example. It then discusses how graph MOTIF can be solved efficiently using different parameters, such as cluster editing, distance to clique, and vertex cover number. The document also analyzes parameters for which graph MOTIF remains NP-hard, such as the deletion set number parameter. In conclusion, it provides references for the algorithms and results discussed.
This document provides a summary of reinforced concrete columns (RCC columns). It defines a column and describes different types of columns based on reinforcement and length. Short columns are less than 12 times the minimum thickness, while long columns are greater than 12 times the thickness. The document outlines preliminary sizing of columns and the functions of tie/spiral reinforcement. It includes design equations for axially loaded columns in working stress design (WSD) and ultimate stress design (USD). Two sample problems are worked through demonstrating column design using both methods.
Curve clipping involves using polygon clipping to test if a curved object's bounding rectangle overlaps a clipping window. If there is no overlap, the object is discarded. If there is overlap, the simultaneous curve and boundary equations are solved to find intersection points. Special cases like circles are considered, such as discarding a circle if its center is outside the clipping window plus/minus the radius. Bezier and spline curves can also be clipped by approximating them as polylines or using their convex hull properties.
Incremental and parallel computation of structural graph summaries for evolvi...Till Blume
This document presents an incremental and parallel algorithm for computing structural graph summaries of evolving graphs. The algorithm incrementally updates graph summaries when the input graph changes, which is often faster than recomputing from scratch. The algorithm partitions the graph and computes summaries in parallel. Experimental results on real-world and benchmark graphs show the incremental algorithm outperforms batch computation even when 50% of the graph changes. The algorithm runs in linear time with respect to graph changes and degree.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Assessment and Planning in Educational technology.pptxKavitha Krishnan
In an education system, it is understood that assessment is only for the students, but on the other hand, the Assessment of teachers is also an important aspect of the education system that ensures teachers are providing high-quality instruction to students. The assessment process can be used to provide feedback and support for professional development, to inform decisions about teacher retention or promotion, or to evaluate teacher effectiveness for accountability purposes.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
2. • De novo genome assembly
• The fragment assembly problem
• Shortest superstring problem
• Seven bridges of Königsberg
• Assembly as a graph theoretical problem
• We construct a de Bruijn graph
• Underlying assumptions of genome assemblies
2Sebastian Schmeier
3. • The process of generating a new genome sequence from NGS genome
sequence reads based on assembly algorithms
• Assembly involves joining short sequence fragments together into long
pieces – contigs
3
reads
genome
sequencing
Sebastian Schmeier
10. • Given:A set of reads (strings) {s1, s2, … , sn }
• Do: Determine a large string s that “best explains” the reads
• What do we mean by “best explains”?
• What assumptions might we require?
10Sebastian Schmeier
11. • Objective: Find a string s such that
• all reads s1, s2, … , sn are substrings of s
• s is as short as possible
• Assumptions:
§ Reads are 100% accurate
§ Identical reads must come from the same location on the genome
§ “best” = “simplest”
11Sebastian Schmeier
13. 13
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
Sebastian Schmeier
14. 14
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
make them unique
Sebastian Schmeier
15. 15
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA Draw edge from x to y
where
suffix from x overlaps prefix from y
Sebastian Schmeier
16. 16
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA Draw edge from x to y
where
suffix from x overlaps prefix from y
Sebastian Schmeier
17. 17
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA Draw edge from x to y
where
suffix from x overlaps prefix from y
Sebastian Schmeier
18. 18
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA Draw edge from x to y
where
suffix from x overlaps prefix from y
Sebastian Schmeier
19. 19
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA
Find Hamiltonian path, that is, a
path that visits every vertex exactly once
Record the First le0er of each vertex +
All le0ers of last vertex
Sebastian Schmeier
20. 20
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ß UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA
Find Hamiltonian path, that is, a
path that visits every vertex exactly once
Record the First le0er of each vertex +
All le0ers of last vertex
ATGCGTGGCA
Sebastian Schmeier
21. 21
ATGCG
GCGTG
GTGGC
TGGCA
• Thus, people generally use k-mers of certain length
ß Here we use 3-mers by cutting the original reads into reads of length 3
ß UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers
ATG
TGC
GCG
CGT
GTG
TGG GGC
GCA
Find Hamiltonian path, that is, a
path that visits every vertex exactly once
Record the First le0er of each vertex +
All le0ers of last vertex
ATGCGTGGCA
ATGGCGTGCA
Sebastian Schmeier
22. • In 1735 Leonhard Euler was presented with the following problem:
§ find a walk through the city that would cross each bridge once and only once
§ He proved that a connected graph with undirected edges contains an Eulerian cycle
exactly when every node in the graph has an even number of edges touching it.
§ For the Königsberg Bridge graph, this is not the case because each of the four nodes
has an odd number of edges touching it and so the desired stroll through the city does
not exist.
22
https://en.wikipedia.org/wiki/Leonhard_Euler
Sebastian Schmeier
23. 23
A B
D
Vertex
Directed edge
• The degree of a vertex: # of edges connected to it
• outdegree: # of outgoing edges
• indegree: # of ingoing edges
• degree(B)?
• outdegree(B)?
• indegree(D)?
C
Sebastian Schmeier
24. • The case of directed graphs is similar:
§ A graph in which indegrees are equal to outdegrees for all nodes is
called 'balanced'.
§ Euler's theorem states that a connected directed graph has an Eulerian
cycle if and only if it is balanced.
• Mathematically/computationally finding Eulerian path is much easier than
Hamiltonian
à we need to reformulate our assembly problem
24Sebastian Schmeier
25. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
25
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
Sebastian Schmeier
26. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
26
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT
Sebastian Schmeier
27. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
27
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
Sebastian Schmeier
28. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
28
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
Sebastian Schmeier
29. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
29
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
Sebastian Schmeier
30. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
30
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
Sebastian Schmeier
31. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
31
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG
Sebastian Schmeier
32. We construct a de Bruijn graph:
§ edges represent k-mers
§ vertices correspond to (k-1)-mers
1. Form a node for every distinct prefix or suffix of a k-mer
2. Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
32
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
33. Can we find a DNA sequence containing all k-mers?
à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à Eulerian path
• a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
• a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
33
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
34. Can we find a DNA sequence containing all k-mers?
à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à Eulerian path
• a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
• a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
34
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
35. Can we find a DNA sequence containing all k-mers?
à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à Eulerian path
35
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA
Sebastian Schmeier
36. Can we find a DNA sequence containing all k-mers?
à In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à Eulerian path
36
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
AT TG
GG
GC CA
CG GT
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA
ATGCGTGGCA
Sebastian Schmeier
37. • Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1. we can generate all k-mers present in the genome
2. all k-mers are error free
3. each k-mer appears at most once in the genome
4. the genome consists of a single chromosome
37Sebastian Schmeier
38. • Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1. we can generate all k-mers present in the genome
2. all k-mers are error free
• Due to these reasons we do NOT choose the longest possible k-mer
• The smaller the k-mer the higher the possibility that we see all k-mers
• Errors:
38
ATGGCGTGCA
ATG TGG GGC GCG CGT GTG TGC GCA
Mostly unaffected k-mers
ATGGCGTGCA
100% affected k-mers
K=3
K=10
Sebastian Schmeier
39. Each k-mer appears at most once in the genome à repeats
• This is most often not true
• This is known as k-mer multiplicity
39
k-mers:
ATG, GCA,
TGC,TGC, GTG, GTG, GCG, GCG, CGT, CGT
Distinct (k-1)-mers:
AT TG GC CA
CG GT
ATG TGC
GTG GCG
CGT
GCA
ATGCGTGCGTGCA
Sebastian Schmeier
40. ‹#› 40
References
How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & GlennTesler. Nature Biotechnology
29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.com/bioinf-workshop/