Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
This document provides an overview of Illumina sequencing, including:
- Illumina sequencing uses a sequencing by synthesis (SBS) approach with reversible terminator chemistry. All four fluorescently labeled bases are present in each sequencing cycle.
- Key steps include library construction, cluster generation, bridge amplification on the flow cell, and single-base sequencing imaging.
- Multiplexing allows indexing of multiple samples by attaching barcodes during library preparation. This enables pooled sequencing of many samples.
- Run statistics like number of reads, percentage of high-quality bases, and alignment rates provide information about run quality and performance.
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.
Introducing data analysis: reads to resultsAGRF_Ltd
Some reads could align to multiple locations:
Reference: AGTCTTAGGGACTTTATAC
AGTC TAGG
TTAC CTTT
GGGA
This is ambiguous - TTAC and GGGA could align in two places each.
We need more information (longer reads or paired reads) to resolve.
Overview of methods for variant calling from next-generation sequence dataThomas Keane
This document provides an overview of methods for variant calling from next-generation sequencing data. It discusses data formats and workflows, including SNP calling, short indels, and structural variation. The document describes alignment, BAM improvement through realignment and base quality recalibration, library merging, and duplicate removal. It also reviews software tools for these processes and introduces the variant call format (VCF) standard.
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
Second part of the training session 'RNA-seq for Differential expression' analysis. We explain the characteristics of RNA-seq data that allow us to detect differential expression. Interested in following this session? Please contact http://www.jakonix.be/contact.html
How to cluster and sequence an ngs library (james hadfield160416)James Hadfield
A presentation for people intersted in understanding how Illumina adapter ligation, clustering ands SBS sequencing work. Follow core-genomics http://core-genomics.blogspot.co.uk/
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing the original DNA sequence from short fragment reads alone. Due to limitations in sequencing technology, the DNA must be broken into short reads which must then be reassembled like a jigsaw puzzle. Challenges include sequencing errors, repeats, and heterozygosity. Various algorithms and techniques are used to assemble the reads, including overlap layout consensus and de Bruijn graphs. Long read technologies help resolve repeats and scaffold contigs. Software recommendations for de novo assembly include SPAdes, Velvet, and CLC Genomics Workbench.
This document provides an overview of Illumina sequencing, including:
- Illumina sequencing uses a sequencing by synthesis (SBS) approach with reversible terminator chemistry. All four fluorescently labeled bases are present in each sequencing cycle.
- Key steps include library construction, cluster generation, bridge amplification on the flow cell, and single-base sequencing imaging.
- Multiplexing allows indexing of multiple samples by attaching barcodes during library preparation. This enables pooled sequencing of many samples.
- Run statistics like number of reads, percentage of high-quality bases, and alignment rates provide information about run quality and performance.
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
Using Snippy to call variants in bacterial short read datasets via alignment to reference, and then using these alignments to produce core SNP alignments for phylogenomics.
Introducing data analysis: reads to resultsAGRF_Ltd
Some reads could align to multiple locations:
Reference: AGTCTTAGGGACTTTATAC
AGTC TAGG
TTAC CTTT
GGGA
This is ambiguous - TTAC and GGGA could align in two places each.
We need more information (longer reads or paired reads) to resolve.
Overview of methods for variant calling from next-generation sequence dataThomas Keane
This document provides an overview of methods for variant calling from next-generation sequencing data. It discusses data formats and workflows, including SNP calling, short indels, and structural variation. The document describes alignment, BAM improvement through realignment and base quality recalibration, library merging, and duplicate removal. It also reviews software tools for these processes and introduces the variant call format (VCF) standard.
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
Second part of the training session 'RNA-seq for Differential expression' analysis. We explain the characteristics of RNA-seq data that allow us to detect differential expression. Interested in following this session? Please contact http://www.jakonix.be/contact.html
How to cluster and sequence an ngs library (james hadfield160416)James Hadfield
A presentation for people intersted in understanding how Illumina adapter ligation, clustering ands SBS sequencing work. Follow core-genomics http://core-genomics.blogspot.co.uk/
High data quality and accuracy are recognized characteristics of Sanger re-sequencing projects and are primary reasons that next generation sequencing projects compliment their results by capillary electrophoresis data validation. We have developed an on-line tool called Primer Designer™ to streamline the NGS-to-Sanger sequencing workflow by taking the laborious task of PCR primer design out of the hands of the researcher by providing pre-designed assays for the human exome. The primer design tool has been created to enable scientists using next generation sequencing to quickly confirm variants discovered in their work by providing the means to quickly search, order and receive suitable pre-designed PCR primers for Sanger sequencing. Using the Primer Designer™ tool to design M13-tailed and non-tailed PCR primers for Sanger sequencing we will demonstrate validation of 28-variants across 24-amplicons and 19-genes using the BDD, BDTv1.1 and BDTv3.1 sequencing chemistries on the 3500xl Genetic Analyzer capillary electrophoresis platform.
Primer design is important for PCR, cDNA synthesis, and DNA sequencing. There are general guidelines for primer design such as 18-30 base pairs in length, 40-60% GC content, melting temperature between 55-66°C. Primers should avoid secondary structures and mismatches at the 3' end. Common programs for primer design are Primer-BLAST and Primer3, which check specificity and secondary structures. Manual design is used for universal primers from multiple sequence alignments. The primers are then checked and optimized before being ordered and used in PCR experiments.
The document discusses function prediction for unknown proteins. It begins with an overview of common methods for function prediction, including sequence and structure similarity, domains and motifs, gene expression, and interactions. It then uses a protein called Msa as a case study, analyzing it with various tools and finding evidence it may function as a signal transducer in bacterial response to environment. Finally, it briefly discusses another protein M46 and challenges in evaluating prediction accuracy.
The document discusses sources affecting next-generation sequencing (NGS) quality and how to identify problematic NGS samples. It analyzes base sequencing quality, quality trimming, biases from base composition, potential contaminations, and gene content of two samples (A and B). Sample B showed poorer base quality, more unmapped reads, and evidence of Proteobacteria contamination compared to Sample A. Further quality control is recommended to identify issues before assembly.
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
This slidedeck provides a technical overview of DNA/RNA preprocessing, template preparation, sequencing and data analysis. It covers the applications for NGS technologies, including guidelines for how to select the technology that will best address your biological question.
This document discusses primer design concepts and applications of real-time PCR. It covers guidelines for optimal primer design such as length, GC content, and melting temperature. Longer primers are more specific but risk secondary structure formation. Primer dimers can decrease efficiency. Real-time PCR has applications in biomedical research, genetic disease diagnosis, cancer research, and forensics. It allows monitoring amplification in real-time and precise quantification of DNA or RNA.
PerkinElmer provides end-to-end next generation sequencing (NGS) services from sample intake to data analysis. Their CLIA-certified sequencing laboratory is staffed by expert scientists with decades of experience in genomics who deliver consistently high quality sequencing results. PerkinElmer offers sequencing, library preparation, capture, bioinformatics analysis, and professional consulting services to build customized NGS solutions that meet customers' specific needs and requirements.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
This document discusses technical variability in PacBio full-length cDNA sequencing (Iso-Seq). It summarizes the Iso-Seq experimental and informatics pipelines, and analyzes read count variation between technical replicates and tissues. While technical variation is minimal, amplification biases from different enzymes and detection limits remain areas for improvement. Combining Iso-Seq with short-read data may help address these challenges.
This document provides an improved protocol for preparing inexpensive Nextera mate pair libraries with enhanced yield and detection of junction adaptors. Key optimizations include adjusting the tagmentation condition to maximize DNA within the targeted size range, decreasing required reagents for strand displacement by performing it after size selection, and optimizing the Covaris shearing condition to achieve a narrow library size distribution suitable for read lengths. The protocol emphasizes using less PCR cycles to reduce amplification and retaining 100-200ng of DNA after size selection for optimal results.
The document discusses PCR (polymerase chain reaction) and primer design. It explains that each PCR cycle contains 3 steps: denaturation, primer annealing, and primer extension. These cycles are usually repeated 25-40 times. It also discusses various factors that must be programmed and optimized for PCR, including connection time and temperature, extension time and temperature, buffer concentration, primer characteristics like length, melting temperature, sequence, and potential for hairpin formations or primer-dimer formations. The document emphasizes that primers should be designed to be specific and avoid these unwanted structures.
This presentation discusses mapping RNA-seq reads to genes in order to construct a count table. There are two scenarios - mapping to a reference genome sequence or performing a de novo assembly. When a reference is available, reads are mapped using gapped alignment to account for reads spanning introns. It is important to use genome annotations and check mapping quality using tools like RSeQC and BamQC to visualize coverage and duplication rates.
Struggling with low editing efficiency or delivery problems? IDT has developed a simple and affordable CRISPR-Cas9 solution that outperforms other methods. In this presentation we present the advantages of using a Cas9:tracrRNA:crRNA ribonucleoprotein (RNP) complex in genome editing experiments, and explain why it is the most efficient driver for genome editing compared to alternative methods, such as expression plasmids or the use of sgRNAs. We also review RNP delivery using cationic lipids and electroporation, and provide tips for optimized transfection in your system.
This document discusses PCR and primer design. It explains that primers should be unique sequences between 15-28 bases in length with a GC content around 50-60% to have an appropriate melting temperature between 52-65°C. Primers should not be able to form hairpins, dimers, or secondary structures. When using a primer pair, their annealing temperatures should be within 3°C of each other. The document provides guidelines for primer design and considerations for multiplex PCR.
The document discusses polymerase chain reaction (PCR). It describes PCR as a technique that amplifies specific DNA sequences using DNA polymerase. The key components of a PCR reaction are a template DNA, primers, DNA polymerase, nucleotides, and buffer. Through repeated heating and cooling cycles, the target DNA is amplified exponentially. The document outlines the three phases of PCR - exponential, linear, and plateau. It also discusses various types of PCR like real-time PCR, nested PCR, and their applications in fields like genetics, forensics, and disease diagnosis.
Primers are short strands of RNA or DNA that serve as starting points for DNA synthesis during DNA replication or PCR. In DNA replication, primers are required for DNA polymerases to add new nucleotides to DNA. Primers are built by primase in short bursts on the lagging strand and allow DNA polymerases to synthesize DNA fragments in the 5' to 3' direction. For PCR, primers must be uniquely designed to target a single region, be 18-24 base pairs long, have a melting temperature of 52-60°C and minimal self-complementarity to avoid unwanted structures and ensure specific amplification.
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesQIAGEN
This slidedeck introduces the concepts of real-time PCR and how to conduct a real-time PCR assay. The topics that are covered include an overview of real-time PCR chemistries, protocols, quantification methods, real-time PCR applications and factors for success.
This document provides an improved protocol for preparing Nextera Mate Pair libraries with the following key modifications:
1) Optimizing tagmentation and Covaris shearing conditions to increase the yield of DNA fragments in the targeted size range.
2) Reducing volumes in library preparation steps to decrease usage of costly reagents while maintaining performance.
3) Recommending sequencing strategies like read lengths that maximize the proportion of read pairs containing junction adaptors, important for scaffolding.
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
This document discusses assembling next-generation sequencing (NGS) data using de novo assembly. De novo assembly involves reconstructing original DNA sequences from short fragment read sequences without a reference genome. It involves finding overlaps between reads and tracing paths in a graph representation while dealing with sequencing errors and repeats. The document outlines the assembly process including finding overlaps, building graphs, simplifying graphs, traversing graphs, and assessing assemblies. It provides examples of tools used for genome, metagenome, transcriptome, and metatranscriptome assembly including Velvet, Abyss, and Trinity.
1) The document discusses comparing bacterial isolates using whole genome sequencing and bioinformatics techniques.
2) Key steps include isolating bacteria from samples, sequencing their genomes, performing de novo assembly and annotation, and clustering homologous genes to determine the pan-genome and core genome.
3) Comparing single nucleotide polymorphisms (SNPs) in the core genome allows construction of phylogenetic trees to infer relationships between isolates.
High data quality and accuracy are recognized characteristics of Sanger re-sequencing projects and are primary reasons that next generation sequencing projects compliment their results by capillary electrophoresis data validation. We have developed an on-line tool called Primer Designer™ to streamline the NGS-to-Sanger sequencing workflow by taking the laborious task of PCR primer design out of the hands of the researcher by providing pre-designed assays for the human exome. The primer design tool has been created to enable scientists using next generation sequencing to quickly confirm variants discovered in their work by providing the means to quickly search, order and receive suitable pre-designed PCR primers for Sanger sequencing. Using the Primer Designer™ tool to design M13-tailed and non-tailed PCR primers for Sanger sequencing we will demonstrate validation of 28-variants across 24-amplicons and 19-genes using the BDD, BDTv1.1 and BDTv3.1 sequencing chemistries on the 3500xl Genetic Analyzer capillary electrophoresis platform.
Primer design is important for PCR, cDNA synthesis, and DNA sequencing. There are general guidelines for primer design such as 18-30 base pairs in length, 40-60% GC content, melting temperature between 55-66°C. Primers should avoid secondary structures and mismatches at the 3' end. Common programs for primer design are Primer-BLAST and Primer3, which check specificity and secondary structures. Manual design is used for universal primers from multiple sequence alignments. The primers are then checked and optimized before being ordered and used in PCR experiments.
The document discusses function prediction for unknown proteins. It begins with an overview of common methods for function prediction, including sequence and structure similarity, domains and motifs, gene expression, and interactions. It then uses a protein called Msa as a case study, analyzing it with various tools and finding evidence it may function as a signal transducer in bacterial response to environment. Finally, it briefly discusses another protein M46 and challenges in evaluating prediction accuracy.
The document discusses sources affecting next-generation sequencing (NGS) quality and how to identify problematic NGS samples. It analyzes base sequencing quality, quality trimming, biases from base composition, potential contaminations, and gene content of two samples (A and B). Sample B showed poorer base quality, more unmapped reads, and evidence of Proteobacteria contamination compared to Sample A. Further quality control is recommended to identify issues before assembly.
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
This slidedeck provides a technical overview of DNA/RNA preprocessing, template preparation, sequencing and data analysis. It covers the applications for NGS technologies, including guidelines for how to select the technology that will best address your biological question.
This document discusses primer design concepts and applications of real-time PCR. It covers guidelines for optimal primer design such as length, GC content, and melting temperature. Longer primers are more specific but risk secondary structure formation. Primer dimers can decrease efficiency. Real-time PCR has applications in biomedical research, genetic disease diagnosis, cancer research, and forensics. It allows monitoring amplification in real-time and precise quantification of DNA or RNA.
PerkinElmer provides end-to-end next generation sequencing (NGS) services from sample intake to data analysis. Their CLIA-certified sequencing laboratory is staffed by expert scientists with decades of experience in genomics who deliver consistently high quality sequencing results. PerkinElmer offers sequencing, library preparation, capture, bioinformatics analysis, and professional consulting services to build customized NGS solutions that meet customers' specific needs and requirements.
The document discusses RNA-Seq data analysis. Some key points:
- RNA-Seq involves sequencing steady-state RNA in a sample without prior knowledge of the organism. It can uncover novel transcripts and isoforms.
- Making sense of the large and complex RNA-Seq data depends on the scientific question, such as finding transcribed SNPs for allele-specific expression or novel transcripts in cancer samples.
- Common applications of RNA-Seq include abundance estimation, alternative splicing detection, RNA editing discovery, and finding novel transcripts and isoforms.
- Analysis steps include mapping reads to a reference genome/transcriptome, generating mapping statistics and quality metrics, differential expression analysis, clustering, and pathway analysis using tools like
This document discusses technical variability in PacBio full-length cDNA sequencing (Iso-Seq). It summarizes the Iso-Seq experimental and informatics pipelines, and analyzes read count variation between technical replicates and tissues. While technical variation is minimal, amplification biases from different enzymes and detection limits remain areas for improvement. Combining Iso-Seq with short-read data may help address these challenges.
This document provides an improved protocol for preparing inexpensive Nextera mate pair libraries with enhanced yield and detection of junction adaptors. Key optimizations include adjusting the tagmentation condition to maximize DNA within the targeted size range, decreasing required reagents for strand displacement by performing it after size selection, and optimizing the Covaris shearing condition to achieve a narrow library size distribution suitable for read lengths. The protocol emphasizes using less PCR cycles to reduce amplification and retaining 100-200ng of DNA after size selection for optimal results.
The document discusses PCR (polymerase chain reaction) and primer design. It explains that each PCR cycle contains 3 steps: denaturation, primer annealing, and primer extension. These cycles are usually repeated 25-40 times. It also discusses various factors that must be programmed and optimized for PCR, including connection time and temperature, extension time and temperature, buffer concentration, primer characteristics like length, melting temperature, sequence, and potential for hairpin formations or primer-dimer formations. The document emphasizes that primers should be designed to be specific and avoid these unwanted structures.
This presentation discusses mapping RNA-seq reads to genes in order to construct a count table. There are two scenarios - mapping to a reference genome sequence or performing a de novo assembly. When a reference is available, reads are mapped using gapped alignment to account for reads spanning introns. It is important to use genome annotations and check mapping quality using tools like RSeQC and BamQC to visualize coverage and duplication rates.
Struggling with low editing efficiency or delivery problems? IDT has developed a simple and affordable CRISPR-Cas9 solution that outperforms other methods. In this presentation we present the advantages of using a Cas9:tracrRNA:crRNA ribonucleoprotein (RNP) complex in genome editing experiments, and explain why it is the most efficient driver for genome editing compared to alternative methods, such as expression plasmids or the use of sgRNAs. We also review RNP delivery using cationic lipids and electroporation, and provide tips for optimized transfection in your system.
This document discusses PCR and primer design. It explains that primers should be unique sequences between 15-28 bases in length with a GC content around 50-60% to have an appropriate melting temperature between 52-65°C. Primers should not be able to form hairpins, dimers, or secondary structures. When using a primer pair, their annealing temperatures should be within 3°C of each other. The document provides guidelines for primer design and considerations for multiplex PCR.
The document discusses polymerase chain reaction (PCR). It describes PCR as a technique that amplifies specific DNA sequences using DNA polymerase. The key components of a PCR reaction are a template DNA, primers, DNA polymerase, nucleotides, and buffer. Through repeated heating and cooling cycles, the target DNA is amplified exponentially. The document outlines the three phases of PCR - exponential, linear, and plateau. It also discusses various types of PCR like real-time PCR, nested PCR, and their applications in fields like genetics, forensics, and disease diagnosis.
Primers are short strands of RNA or DNA that serve as starting points for DNA synthesis during DNA replication or PCR. In DNA replication, primers are required for DNA polymerases to add new nucleotides to DNA. Primers are built by primase in short bursts on the lagging strand and allow DNA polymerases to synthesize DNA fragments in the 5' to 3' direction. For PCR, primers must be uniquely designed to target a single region, be 18-24 base pairs long, have a melting temperature of 52-60°C and minimal self-complementarity to avoid unwanted structures and ensure specific amplification.
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesQIAGEN
This slidedeck introduces the concepts of real-time PCR and how to conduct a real-time PCR assay. The topics that are covered include an overview of real-time PCR chemistries, protocols, quantification methods, real-time PCR applications and factors for success.
This document provides an improved protocol for preparing Nextera Mate Pair libraries with the following key modifications:
1) Optimizing tagmentation and Covaris shearing conditions to increase the yield of DNA fragments in the targeted size range.
2) Reducing volumes in library preparation steps to decrease usage of costly reagents while maintaining performance.
3) Recommending sequencing strategies like read lengths that maximize the proportion of read pairs containing junction adaptors, important for scaffolding.
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
This document discusses assembling next-generation sequencing (NGS) data using de novo assembly. De novo assembly involves reconstructing original DNA sequences from short fragment read sequences without a reference genome. It involves finding overlaps between reads and tracing paths in a graph representation while dealing with sequencing errors and repeats. The document outlines the assembly process including finding overlaps, building graphs, simplifying graphs, traversing graphs, and assessing assemblies. It provides examples of tools used for genome, metagenome, transcriptome, and metatranscriptome assembly including Velvet, Abyss, and Trinity.
1) The document discusses comparing bacterial isolates using whole genome sequencing and bioinformatics techniques.
2) Key steps include isolating bacteria from samples, sequencing their genomes, performing de novo assembly and annotation, and clustering homologous genes to determine the pan-genome and core genome.
3) Comparing single nucleotide polymorphisms (SNPs) in the core genome allows construction of phylogenetic trees to infer relationships between isolates.
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
How genomics is changing the practice of public health microbiology. The role of whole genome sequencing as the "one true assay". Another powerful tool for the epidemiologist.
A presentation to a lay audience at Melbourne Knowledge Week on how bacteria are a part of our life and what we are doing with genomics to manage them.
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Torsten Seemann
This document summarizes a presentation about transitioning a public health microbiology laboratory from traditional bacterial typing methods like PFGE to whole genome sequencing (WGS) and bioinformatics. Key points include:
1) WGS allows for higher resolution analysis of bacterial isolates compared to traditional methods and can identify genomic variations, plasmids, and resistance/virulence genes.
2) Implementing WGS requires developing automated pipelines for processing large volumes of sequencing data using techniques like read mapping, de novo assembly, and k-mer analysis.
3) The goals are to modernize operations, increase research output, and develop shared online systems and data sharing between laboratories internationally.
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
An introduction to basic genomics bioinformatics concepts in 20 minutes for an audience of clinicians, epidemiologists and other public health officials.
This document contains information about genome assemblies and annotations for multiple mouse strains. It lists the strains and key details about their genome assemblies such as length, N50 size, and largest scaffold. It also lists the organizations and researchers involved in generating the assemblies and annotations over multiple years from 2014 to 2018. Manual curation and integration of gene sets and annotations is described.
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
This document summarizes research assessing the impact of transposable element (TE) variation on mouse phenotypes and traits. Over 100,000 TE variants were detected among 17 laboratory mouse strains by whole genome sequencing, including both insertions present or absent compared to the reference genome. Validation experiments confirmed the accuracy of the calls. Analysis showed the distribution and structure of TE families varies between strains, and some TE classes like ERVs are expanding. The goal is to better understand how TE variation contributes to phenotypic diversity and complex traits in mice.
Presentation from the 3rd Joint Meeting of the Antimicrobial Resistance and Healthcare-Associated Infections (ARHAI) Networks, organised by the European Centre of Disease Prevention and Control - Stockholm, 11-13 February 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
Long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore offer several advantages over short read technologies. They can generate reads of over 100kb which allows for untangling of repeats and completion of genomes and phasing of haplotypes. While PacBio is more established, Nanopore offers the potential for real-time, portable sequencing. Both require adaptation of bioinformatics tools and analysis approaches. The new technologies will change genomics jobs by moving more to streaming analysis and requiring skills in adapting to changing technologies.
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
I describe the three levels of parallelism that can be exploited in bioinformatics software (1) clusters of multiple computers; (2) multiple cores on each computer; and (3) vector machine code instructions.
This document summarizes a presentation on using whole genome sequencing (WGS) for rapid characterization of bacterial outbreaks. The presenter discusses transitioning public health labs from traditional typing methods to WGS-based approaches. Key points include developing automated analysis pipelines to identify bacteria, determine antimicrobial resistance and virulence genes, and construct phylogenomic trees from core genome SNPs. The goal is a cloud-based system allowing labs to securely upload and analyze sequencing data with open source tools integrated in modular pipelines.
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
This document discusses de novo genome assembly, which is the process of reconstructing long genomic sequences from many short sequencing reads without the aid of a reference genome. It is challenging due to factors like short read lengths, repetitive sequences that complicate the assembly graph, and sequencing errors. The goals of assembly are to produce contiguous sequences with high completeness and correctness by resolving overlaps between reads into consensus sequences. Metrics like N50, core gene content, and read remapping are used to assess assembly quality.
Rapid automatic microbial genome annotation using Prokka
Dr Torsten Seemann presents on Prokka, a tool he developed for rapid automatic annotation of microbial genomes. Prokka uses existing gene prediction tools like Prodigal and Infernal along with database searches to identify features like protein coding genes, tRNAs, and rRNAs. Prokka aims to annotate genomes quickly in under 15 minutes while providing standardized GFF3 and Genbank output files along with provenance on the sources of annotations. Prokka has been used to annotate over 50,000 draft genomes and is an ongoing project aimed at improving accuracy, modularity, and performance.
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
This document discusses multiple efforts related to developing reference genomes and gene annotations for laboratory mouse strains:
1) Genome assemblies have been improved for several strains using techniques like Illumina sequencing, Dovetail scaffolding, and PacBio alignments.
2) Gene predictions are being developed using a combination of annotation lifting from C57BL/6J, local refinement with strain-specific RNA-seq data, and de novo prediction.
3) Resources have been created for viewing and accessing these new reference genomes and annotations.
This document summarizes a presentation on mouse genomic variation and its effect on phenotypes and gene regulation. It discusses the Mouse Genomes Project which sequenced 18 laboratory mouse strains to catalog genetic variants like SNPs and structural variations. It also analyzed RNA-sequencing data to identify over 36,000 candidate RNA editing sites, with most being adenosine-to-inosine edits. Some edits were found to alter protein coding sequences or be conserved across species, potentially impacting gene regulation and phenotypes.
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Игорь Шадеркин
Antimicrobial resistance in Neisseria gonorrhoeae is a global problem, but valid data are lacking in many areas. Gonorrhea surveillance is crucial for public health to help prevent untreatable infections and inform treatment guidelines. However, resistance to traditional antibiotics is very high in most countries, and multi-drug resistant strains have emerged. Improved diagnostic testing and surveillance of antibiotic resistance according to WHO standards is needed worldwide, especially in low-resource areas.
this is the project regarding the detection and analysis of DNA sequences,it provide the fascility to find the repets from the hudge data set.we can find tha all repeats which is occured in human body.
This document discusses instrumentation and analysis of the NAS Parallel Benchmarks (NPB) application using the Extrae tracing library. It summarizes the tests performed on local and remote machines using 2, 4, 8, 16, and 32 processes. Key metrics like computation time, communication time, load imbalance, and bottlenecks are measured. The analysis shows the NPB application scales well on the remote server but not the local laptop beyond 16 processes due to increased communication and wait times.
PCR is used to amplify specific DNA sequences. It involves repeated cycles of heating and cooling of the DNA sample to cause DNA replication between two primers that flank the target sequence. Each cycle doubles the amount of target DNA. After many cycles, the target is amplified exponentially into billions of copies. Key steps are denaturation to separate DNA strands, annealing to allow primers to bind, and extension to replicate the target using a DNA polymerase. Proper primer design is important for specificity of the amplification.
The document describes the polymerase chain reaction (PCR) process for amplifying a specific segment of DNA. PCR makes millions of copies of the target DNA segment in a few hours by using DNA polymerase enzyme and primers. It involves repeated cycles of heating and cooling of the DNA sample to denature and renature the DNA. The target DNA segment is amplified selectively between the forward and reverse primers. Key requirements for successful PCR include appropriate primer design and optimization of reaction conditions.
Polymerase chain reaction (PCR) is a technique used to amplify a specific region of DNA without cloning. It involves repeated cycles of separating the DNA strands by heating and synthesizing new strands with DNA polymerase. Two primers that flank the region of interest are used to determine the boundaries of the target sequence. During each cycle of PCR, the DNA is denatured by heating and the primers anneal to the single-stranded DNA. DNA polymerase then extends from the primers to synthesize new DNA strands. After multiple cycles, the target sequence is amplified exponentially into millions of copies. PCR is widely used in research, forensics, and medical diagnosis due to its simplicity, sensitivity, and ability to amplify specific DNA
PCR (polymerase chain reaction) is a technique that allows millions of copies of a specific DNA fragment to be produced. It involves repeated cycles of heating and cooling of the DNA sample in the presence of primers and a DNA polymerase. This allows for the targeted amplification of the DNA region between the primers. Real-time PCR (qPCR) allows for quantitative analysis of the amount of DNA present in samples and is commonly used for gene expression analysis and detection of pathogens.
qRT-PCR is a technique that allows quantification of RNA transcripts. It involves reverse transcribing RNA to cDNA, then amplifying and detecting the cDNA using PCR. There are two main detection methods - fluorescent probes like TaqMan probes which fluoresce upon cleavage during PCR, and fluorescent dyes like SYBR Green which bind double stranded DNA. Analysis of the amplification curve allows quantification of initial transcript levels based on the cycle threshold. Controls are important for quality assurance and normalization to account for differences in input RNA and reaction efficiency. qRT-PCR is useful for studying gene expression levels and transcriptional changes.
Polymerase chain reaction (PCR) is a technique used in molecular biology to amplify a single copy or a few copies of a segment of DNA across several orders of magnitude, generating thousands to millions of copies of a particular DNA sequence. It is an easy, cheap, and reliable way to repeatedly replicate a focused segment of DNA, a concept which is applicable to numerous fields in modern biology and related sciences.
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...takuyayamamoto1800
Parallel efficiency of GAMG solver in OpenFOAM is evaluated for EPYC server. Especially, in this study, the influence of coarsestLevelCorr on the calculation time is evaluated in lid driven cavity flow.
This document summarizes a presentation about RNase H2 PCR (rhPCR), a new molecular technology that uses an RNA residue in PCR primers and a thermostable RNase H2 enzyme. It describes how rhPCR works, the advantages it provides over traditional PCR including reduced primer-dimer formation and improved specificity for rare allele detection. Two generations of cleavable primer designs - GEN1 and GEN2 - are discussed, along with their different applications. Examples are provided that demonstrate how rhPCR can improve assays for SNP genotyping, multiplex PCR, and detection in complex backgrounds.
BLAST and FASTA are algorithms for searching sequence databases to find local alignments between a query sequence and database sequences, with BLAST providing faster searches and improved statistical analysis compared to FASTA. Both algorithms work by first identifying short exact matches between sequences and then extending these matches to identify longer regions of similarity. The algorithms model DNA and protein sequence alignments as coin tosses to determine the expected length of the longest matching region between random sequences.
Primers are short DNA sequences used to initiate DNA replication via polymerase chain reaction (PCR). Good primer design is important for PCR to work properly. Primers should be 18-24 base pairs in length and have a GC content and melting temperature that allows for specific annealing. Software tools can help design primers that meet characteristics like avoiding primer-dimer formations and complementarity at the 3' end. Common steps in primer design include specifying the target DNA, selecting primer length and melting temperature parameters, and filtering results.
This document discusses a paradigm shift in wireless sensor network design from peer-to-peer to network-based approaches using the CEO problem framework. It proposes a wireless sensor network model where sensors observe noisy versions of a random source and transmit to a fusion center. An iterative algorithm is introduced to estimate the observation error probabilities at each sensor and update log-likelihood ratios for decoding. Simulation results show the estimated probabilities achieve near-optimal performance and the algorithm works well even when probabilities vary. Some open questions remain around multiple access techniques, source-channel separation, deriving rate-distortion bounds, and short block length cases.
Using neon for pattern recognition in audio dataIntel Nervana
This document provides an outline and overview of a recurrent neural network meetup focused on using the neon deep learning framework. The meetup covers an introduction to neon, setting up the workshop environment, convolutional neural network theory and hands-on examples, recurrent neural network theory and hands-on examples including for audio data processing. Benchmark results are shown indicating neon's performance. Examples demonstrated include music genre classification using CNNs and whale call identification using RNNs.
Generating random numbers in a highly parallel program is surprising non-trivial. A lot of good generators have lots of state and is purely serial. Simple generators like LCG can leapfrog ahead but of limited quality and depends on #cores. We want our code to be independent of the degree of parallelism.
This document discusses parallel random number generation techniques. It reviews serial random number generators like linear congruential generators and lagged Fibonacci generators. For parallel generation, it describes methods like leapfrogging where each thread independently generates a subset of the sequence, and sequence splitting where the serial sequence is divided among threads. Cryptographic hashing of incremental inputs is also proposed as a parallel-friendly approach that generates independent and high-quality random streams for each thread.
Genetic engineering involves manipulating DNA through techniques like selective breeding, hybridization, genetic bottlenecks, inbreeding, and genetic engineering. Genetic engineering uses vectors to insert genes into host organisms. Key steps include isolating the gene, inserting it into a host using a vector, producing copies of the host, and purifying the gene product. Restriction enzymes and ligases are important tools that cut and join DNA. PCR is used to amplify DNA, and sequencing methods like Sanger sequencing determine the DNA sequence. Primer design is important for techniques like PCR, cloning, and discovery of unknown sequences through degenerate primers.
This document discusses ring-based homomorphic encryption schemes and compares the efficiency of four schemes: BGV, FV, NTRU, and YASHE. The schemes are analyzed by measuring ciphertext size under varying parameters like plaintext modulus size and circuit depth. For small plaintext sizes, YASHE is most efficient, but BGV generally performs best as plaintext size increases. The analysis provides a starting point for comparing ring-based schemes but could be improved with a stricter security analysis.
Similar to Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012 (20)
How to write bioinformatics software no one will useTorsten Seemann
This document provides guidance on how to write bioinformatics software that others will find useful. It recommends choosing a descriptive name, keeping documentation and installation simple, following standards for command line interfaces and file formats, publishing the software on repositories, and supporting users through updates, documentation, and issue tracking. The document concludes by discussing the presenter's work on various bioinformatics tools, including improvements planned.
Snippy is a command line tool for rapidly calling bacterial variants and performing core genome alignments from sequencing data. It runs quickly using multiple CPU cores and produces standard output files like VCF, BAM, and TXT. Snippy can also combine results from multiple runs into a phylogenetic tree of aligned sequences.
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...Torsten Seemann
Torsten Seemann discussed bioinformatic tools for diagnostic laboratories using whole genome sequencing (WGS). He explained that WGS generates large amounts of sequencing reads that can be assembled de novo or aligned to references to identify single nucleotide polymorphisms (SNPs) and characterize genomes. Key applications of WGS include diagnostic identification, antimicrobial resistance profiling, virulence factor detection, and high-resolution epidemiological typing through SNP analysis and phylogenetic trees. Seemann emphasized that WGS analysis requires metadata, domain expertise, and open data sharing for maximum public health benefit.
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...Torsten Seemann
This talk introduces a Linux Professional audience to bacterial genomics and modern sequencing technology. The title is slightly misleading and is a bit of clickbait. The diagrams are good.
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
Long read sequencing - the good, the bad, and the really cool. Covers Illumina SLR, Pacbio RSII and Oxford Nanopore as of June 2015. Discusses bioinformatics differences of long reads over short reads.
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Torsten Seemann
Invited talk at the Australian Society for Microbiology Annual Conference 2014 on "FriPan" our tool for visualizing bacterial pan genomes across 10-100s of isolates.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
ESPP presentation to EU Waste Water Network, 4th June 2024 “EU policies driving nutrient removal and recycling
and the revised UWWTD (Urban Waste Water Treatment Directive)”
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
3. Illumina reads
● Usually 100 or 150 bp
○ 250bp rolling out now, 400bp next year
● Indel errors rare
● Homopolymer errors rare
● Substitution errors < 1%
○ Error rate higher at 3' end
● Adaptor issues
○ rare in HiSeq (TruSeq prep)
○ more common in MiSeq (Nextera prep)
● Very high quality overall
4. Illumina libraries
●"single end"
○just a shotgun read sequenced from one end
●"paired end"
○~500bp fragments sequenced at both ends
○very reliable
●"mate pair"
○circularized 2-10 kbp fragments sequencing
○then "paired end" protocol
○reliability varies
5. ● garbage reads
○ instrument weirdness
● duplicate reads
○ low complexity library, PCR duplicate
● adaptor read-through
○ fragment too short
● indel errors
○ skipping bases, inserting extra bases
● uncalled base
○ couldn't reliably estimate, replace with "N"
● substitution errors
○ reading wrong base
Sequences have errors
More common
Less common
6. Why clean reads?
● Erroneous data may cause software to:
○ run more slowly
○ use more RAM
○ produce poor / biased / incorrect results
● Cleaning can:
○ improve overall average quality of the reads
■ hopefully giving a reliable result
○ reduce the volume of reads
■ some algorithms are O(N.logN) or O(N2
)
■ enable processing when otherwise couldn't
● (some software does handle them appropriately)
8. DNA sequence quality
● DNA sequences often have a
quality value associated with each
nucleotide
● A measure of reliability for each base
○ as it is derived from physical process
■ chromatogram (Sanger sequencing)
■ pH reading (Ion Torrent sequencing)
● Formalised by the Phred software for the
Human Genome Project
9. Phred qualities
● Q = -10 log10
P <=> P = 10- Q / 10
○ Q = Phred quality score
○ P = probability of base call being incorrect
Quality value Chance it is wrong Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
12. Anatomy of a FASTQ entry
@read00179
AGTCTGATATGCTGTACCTATTATAATTCTAGGCGCTCAT
GCCCGCGGATATCGTAGCTATATGCTTCA
+
8;ACCCD?DD???@B9<9<CAC@=AAAA8A;B<A@882,+
495;;3990,02..-&-&-*,,,,(0**#
Start symbol
Sequence ID
Sequence
Separator line
Encoded quality values,
one symbol per nucleotide
16. Ambiguous bases
● If there is ambiguity in the base call, an "N" is used
@ILLUMINA:6:1:964:115#GATCAG/1
GGACCTGAGAGTGTGCATGAAGAGGGCAGCGCGCACNGCATCA
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Possible software responses:
○ Crash!
○ Ignore it
○ Silently convert to fixed or random base (Velvet > 'A')
○ Handle it appropriately (Bowtie2)
● Small proportion overall, safer to discard
17. Homopolymers
● A read consisting of all the same base
@ILLUMINA:6:1:964:115#GATCAG/1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
● Often occur from clusters at edge of flowcell lane
● Early Illumina software called '?' as 'A' rather than N
● Unlikely to be present in real DNA
● Best to discard
● Less common with newer Illumina OLB software
18. Quality trimming
●Remove low quality sequence
○Q=13 corresponds to 5% error (p=0.05)
○Q=0..13 encoded by: !"#@%&'()*+,-/
@ILLUMINA:6:1:9646:1115#GATCAG/1
GGACCTGAGAGTGTGCATGAAGAGGGCAGCCCCGCACTGCATG
+
HHHGFGEEECDEBA@BBBA<=<;:98743720&,+**_%$#"!
●Can trim per
○each base
○window moving average eg. 3 base mean
○minimum % good per window eg. need 4 of 5
19. Illumina Adaptors
● Used in the sequencing chemistry
● Can appear at ends of read sequences
● Worse for mate-pair than for paired-end reads
● PCR Primer
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
● Genomic DNA Sequencing Primer
CACTCTTTCCCTACACGACGCTCTTCCGATCT
● All other TruSeq & Nextera Adaptors
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
20. Adaptor clipping
● Method
○ Align 3' and 5' read end against all adaptor sequences
○ If there is an anchored "match", trim the read
● Minimum length of match?
○ want to remove adaptor, but not real sequence [10 bp]
● Allow substitutions in match?
○ as reads have errors, need some tolerance [1 sub]
● Allow gaps/indels in match?
○ indels are unlikely in Illumina reads [no]
● Slow to perform compared to other preprocessing steps
21. Decloning
●Illumina "mate pair" sequencing
○Requires a lot of starting DNA
○Challenging protocol to implement reliably
○Not enough final DNA leads to PCR duplicates
○Coverage is highly non-uniform and sporadic
○Causes bias in analyses
●Decloning
○Replace clones with a single representative
○Choose representative with highest quality
○Helps salvage usable information content
22. Read length
●Enforce a minimum read length L
●k-mer based tools
○break reads into k-mers, so L < k is pointless
○Velvet, Trinity, Gossamer, ...
●Uniqueness
○desire reasonable uniqueness of sequence,
otherwise multiple mapping will take forever!
○L=24 is bare minimum (I reckon)
○BWA, Bowtie, Shrimp, ...
23. Other strategies
●Digital normalisation
○remove low frequency k-mers
○remove reads w/ too many low freq k-mers
●Error correction
○replace low frequency k-mers with their
"closest" high-frequency k-mer
○other methods I don't understand yet
●Just use local alignment to "solve" the problem
28. The GKP FFPE BWA PE mystery
●FFPE sample
○formalin-fixed, paraffin embedded
○long term tissue archival storage method
●Sequencing
○HiSeq-2000
○22 million x 100bp paired-end reads
○quality looks good ... but not mapping!
32. Nesoni
●Implemented primarily by Paul Harrison @ VBC
○swiss army knife
○snp calling, phylogenetics, DGE, ....
○extendible pipeline system (Python)
●We used the "nesoni clip:" module
○adaptor clipping = on
○min quality = Q10
○allow Ns = no
○min length = 2
38. Nesoni clip: summary
nesoni 0.92
FASTQ offset seems to be 33
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,611,994 read-pairs
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 121,236 read-1 too short after quality clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 483,729 read-1 too short after adaptor clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 22,007,029 read-1 kept
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-1 average input length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 75.426 read-1 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 673,175 read-2 too short after quality clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 419,343 read-2 too short after adaptor clip
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,519,476 read-2 kept
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 100.000 read-2 average input length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 77.201 read-2 average output length
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 21,147,301 pairs kept after clipping
(> SM_AdMP1_ID_07B26948_L004_CLIPPED 1,231,903 reads kept after clipping
started 21 November 2012 11:02 AM
finished 21 November 2012 11:43 AM
run time 0:40:06