Next-generation sequencing and SNPs was presented. The presentation discussed factors for finding real SNPs, including sequencing technology, mapping algorithms, variant calling, filtering, and experiences from exome resequencing. Accurate variant detection requires optimizing each step from sequencing to filtering false positives and negatives.
Face it, backticks are a pain. BASH $() construct provides a simpler, more effective approach. This talk uses examples from automating git branches and command line processing with getopt(1) to show how $() works in shell scripts.
Starting with the system calll "getrusage", this returns synchronous, process-level information, mainly max RSS used. This talk describes the output from getrusage, the rusage formatting utility in ProcStats, and several examples of using it to examine time and memory use.
Optional first & final outputs to give baseline and total status, differencing avoids extraneous output, and user messages allow arbitrary stat's and tracking content.
The combination makes this nice for tracking both long-lived and shorter, more intensive processing.
This talk describes refactoring FindBin::libs from Perl5 to Raku: breaking the module up into functional pieces, writing the tests using Raku, testing and releasing the module with mi6.
Face it, backticks are a pain. BASH $() construct provides a simpler, more effective approach. This talk uses examples from automating git branches and command line processing with getopt(1) to show how $() works in shell scripts.
Starting with the system calll "getrusage", this returns synchronous, process-level information, mainly max RSS used. This talk describes the output from getrusage, the rusage formatting utility in ProcStats, and several examples of using it to examine time and memory use.
Optional first & final outputs to give baseline and total status, differencing avoids extraneous output, and user messages allow arbitrary stat's and tracking content.
The combination makes this nice for tracking both long-lived and shorter, more intensive processing.
This talk describes refactoring FindBin::libs from Perl5 to Raku: breaking the module up into functional pieces, writing the tests using Raku, testing and releasing the module with mi6.
We have all seen repetitive code, maintained by cut+paste, that creates an object, calls a method, checks a return, calls a method, checks a return... all of it difficult to maintain because of its sheer size.
Object::Exercise replaces the pasted loops with data-driven code, the operation controlled by a data structure of methods, arguments, and expected return values. This replaces cut+paste with declarative data.
This talk describes O::E and shows a few ways to apply it for testing the MadMongers' Adventure game.
Variable interpolation is a standard way to BASH your head. This talk looks at interpolation, eval, ${} handling and "set -vx" to debug basic variable handling.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
Metadata-driven lazyness, Perl, and Jenkins provide a nice mix for automated testing. With Perl the only thing required to start testing is a files path, from there the possibilities are endless. Using Symbol's qualify_to_ref makes it easy to validate @EXPORT & @EXPORT_OK, knowing the path makes it easy to use "perl -wc" to get diagnostics.
The beautiful thing is all of it can be lazy... er, "automated". And repeatable. And simple.
(originally presented at YAPC::Europe::2007)
No-one is as critical about something as those that love it dearly. Mark Fowler has been collecting complaints from professional Perl developers for years about what warts still remain with the language when strict and warnings are turned on.
Are these problems unsolvable? A veteran Perl programmer himself Mark attempted to try and solve these issues - and then turned to the experts, the people who write books on Perl, the people who maintain the perl interpreter itself, for help.
This is what he learned...
vfsStream - a better approach for file system dependent testsFrank Kleine
Have you ever been annoyed by testing classes or functions operating on the file system? Be it tests that rely on presence of physical files, the problem of not cleaning up correctly after the test run or checking that your algorithm creates the correct directories and files with correct file permissions. Then this is for you: vfsStream to the rescue!
Perl6 regular expression ("regex") syntax has a number of improvements over the Perl5 syntax. The inclusion of grammars as first-class entities in the language makes many uses of regexes clearer, simpler, and more maintainable. This talk looks at a few improvements in the regex syntax and also at how grammars can help make regex use cleaner and simpler.
Building a Perl5 smoketest environment in Docker using CPAN::Reporter::Smoker. Includes an overview of "smoke testing", shell commands to contstruct a hybrid environment with underlying O/S image and data volumes for /opt, /var/lib/CPAN. This allows maintaining the Perly smoke environemnt without having to rebuild it.
It's quite hard to write cross-platform CPAN modules, especially when you use XS to interface with C libraries. Luckily, CPAN Testers tests your modules on many platforms for you. Come see how CPAN Testers helped me to create a fully portable module.
Presented at YAPC::Europe 2011.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Ruby is amazing. It has a huge standard library and a core choc full of weird and wonderful things. In this talk, given at the Ipswich Ruby User Group, I give a whimsical nonstop tour through some of the more obscure parts of Ruby.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
We have all seen repetitive code, maintained by cut+paste, that creates an object, calls a method, checks a return, calls a method, checks a return... all of it difficult to maintain because of its sheer size.
Object::Exercise replaces the pasted loops with data-driven code, the operation controlled by a data structure of methods, arguments, and expected return values. This replaces cut+paste with declarative data.
This talk describes O::E and shows a few ways to apply it for testing the MadMongers' Adventure game.
Variable interpolation is a standard way to BASH your head. This talk looks at interpolation, eval, ${} handling and "set -vx" to debug basic variable handling.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
Metadata-driven lazyness, Perl, and Jenkins provide a nice mix for automated testing. With Perl the only thing required to start testing is a files path, from there the possibilities are endless. Using Symbol's qualify_to_ref makes it easy to validate @EXPORT & @EXPORT_OK, knowing the path makes it easy to use "perl -wc" to get diagnostics.
The beautiful thing is all of it can be lazy... er, "automated". And repeatable. And simple.
(originally presented at YAPC::Europe::2007)
No-one is as critical about something as those that love it dearly. Mark Fowler has been collecting complaints from professional Perl developers for years about what warts still remain with the language when strict and warnings are turned on.
Are these problems unsolvable? A veteran Perl programmer himself Mark attempted to try and solve these issues - and then turned to the experts, the people who write books on Perl, the people who maintain the perl interpreter itself, for help.
This is what he learned...
vfsStream - a better approach for file system dependent testsFrank Kleine
Have you ever been annoyed by testing classes or functions operating on the file system? Be it tests that rely on presence of physical files, the problem of not cleaning up correctly after the test run or checking that your algorithm creates the correct directories and files with correct file permissions. Then this is for you: vfsStream to the rescue!
Perl6 regular expression ("regex") syntax has a number of improvements over the Perl5 syntax. The inclusion of grammars as first-class entities in the language makes many uses of regexes clearer, simpler, and more maintainable. This talk looks at a few improvements in the regex syntax and also at how grammars can help make regex use cleaner and simpler.
Building a Perl5 smoketest environment in Docker using CPAN::Reporter::Smoker. Includes an overview of "smoke testing", shell commands to contstruct a hybrid environment with underlying O/S image and data volumes for /opt, /var/lib/CPAN. This allows maintaining the Perly smoke environemnt without having to rebuild it.
It's quite hard to write cross-platform CPAN modules, especially when you use XS to interface with C libraries. Luckily, CPAN Testers tests your modules on many platforms for you. Come see how CPAN Testers helped me to create a fully portable module.
Presented at YAPC::Europe 2011.
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
With the size of genomic data doubling every seven months, existing tools in the genomic space designed for the gigabyte scale tip over when used to process the terabytes of data being made available by current biobank-scale efforts. To enable common genomic analyses at massive scale while being flexible to ad-hoc analysis, Databricks and Regeneron Genetics Center have partnered to launch an open-source project.
The project includes optimized DataFrame readers for loading genomics data formats, as well as Spark SQL functions to perform statistical tests and quality control analyses on genomic data. We discuss a variety of real-world use cases for processing genomic variant data, which represents how an individual’s genomic sequence differs from the average human genome. Two use cases we will discuss are: joint genotyping, in which multiple individuals’ genomes are analyzed as a group to improve the accuracy of identifying true variants; and variant effect annotation, which annotates variants with their predicted biological impact. Enabling such workflows on Spark follows a straightforward model: we ingest flat files into DataFrames, prepare the data for processing with common Spark SQL primitives, perform the processing on each partition or row with existing genomic analysis tools, and save the results to Delta or flat files.
Ruby is amazing. It has a huge standard library and a core choc full of weird and wonderful things. In this talk, given at the Ipswich Ruby User Group, I give a whimsical nonstop tour through some of the more obscure parts of Ruby.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
Slides for my talk at SkyCon'12 in Limerick.
Here I've squeezed four talks into one, covering a lot of ground quickly, so I've included links to more detailed presentations and other resources.
The fundamentals and advance application of Node will be covered. We will explore the design choices that make Node.js unique, how this changes the way applications are built and how systems of applications work most effectively in this model. You will learn how to create modular code that’s robust, expressive and clear. Understand when to use callbacks, event emitters and streams.
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)James Titcumb
As your application grows, you soon realise you need to break up your application into smaller chunks that talk to each other. You could just use web services to interact, or you could take a more robust approach and use the message broker RabbitMQ. In this talk, we will take a look at the techniques you can use to vastly enhance inter-application communication, learn about the core concepts of RabbitMQ, cover how you can scale different parts of your application separately, and modernise your development using a message-oriented architecture
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
Presentation at BOSC2012 by J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
Presentation at BOSC2012 by P Rocca-Serra - The open source ISA metadata tracking framework: from data curation and management at the source, to the linked data universe
2. Aim
To identify the SNP that causes disease,
phenotype
– Find them all, so you don’t miss it (false
negatives)
– Not find too many, so it’s useful (false
positives)
3. General principle
Map reads to reference sequence
Convert from read-based to base-based
(i.e. pileup)
Look at differences
4. This presentation
Factors in finding real SNPs
– Sequencing technology
– Mapping algorithms and initial calling
– Post-mapping tweaking
– Calling
– Filtering
Based on experiences in exome resequencing;
“experiment 5” on last slide Thomas
5. 1. Sequencing
• Provides raw data
• Different technologies
Different accuracy (critical!)
Different types of errors
8. Homopolymer runs
• Especially 454
39% of errors are homopolymers
• A5 motifs: 3.3% error rate
• A8 motifs: 50% error rate!
Reason: use signal intensity as a measure for
homopolymer length
29. VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00
. . .
30. VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0" header
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00
. . .
column header data
32. Pileup => VCF
Custom scripts, then annotate
java -Xmx10g
-jar GenomeAnalysisTK.jar
-T VariantAnnotator
--assume_single_sample_reads sample
-R human_b36_plus.fasta
-D dbsnp_129_b36_plus.rod
-I input.bam
-B variant,VCF,unannotated.vcf
-o annotated.vcf
-A AlleleBalance
-A MappingQualityZero
-A LowMQ
-A RMSMappingQuality
-A HaplotypeScore
-A QualByDepth
-A DepthOfCoverage
-A HomopolymerRun
33. 5. Filtering
• Aim: to reduce number of false positives
• Options:
– Depth of coverage
– Mapping quality
– SNP clusters
– Allelic balance
– Number of reads with mq0
41. Indels
Still more tricky than SNPs
– samtools/dindel/GATK
– Sample of 10 individuals: on average per
individual:
• 2 novel functional high-quality SNPs
• 18 novel functional high-quality indels
“I trust manual interpretation of the reads more
than the basic quality parameters we use”
44. Conclusions
Different tools exist and are created
Best to combine (intersect) the results from
different pipelines
Genome Analysis ToolKit (GATK) provides
useful bam-file processing tools:
– Realignment around indels
– Base quality recalibration
45. Use in resequencing
• Identify SNPs/indels
• Consequences (loss-of-function?)
• Prevalence in cases/controls
• Model:
– Dominant: any het
– Recessive: homnonref or compound het
46. References
• Chan E. In: Single Nucleotide Polymorphisms,
Methods in Molecular Biology 578 (2009)
• McKenna et al. Genome Res 20:1297-1303 (2010)
• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)
• Li H et al. Bioinformatics 25:2078-2079 (2009)
• Li H et al. Genome Res 18:1851-1858 (2008)