SlideShare a Scribd company logo
Next-generation sequencing
        and SNPs
               Jan Aerts
    Wellcome Trust Sanger Institute
         jan.aerts@gmail.com
Aim
To identify the SNP that causes disease,
 phenotype
  – Find them all, so you don’t miss it (false
    negatives)
  – Not find too many, so it’s useful (false
    positives)
General principle
Map reads to reference sequence
Convert from read-based to base-based
 (i.e. pileup)
Look at differences
This presentation
Factors in finding real SNPs
   – Sequencing technology
   – Mapping algorithms and initial calling
   – Post-mapping tweaking
   – Calling
   – Filtering

Based on experiences in exome resequencing;
  “experiment 5” on last slide Thomas
1. Sequencing
• Provides raw data
• Different technologies
  Different accuracy (critical!)
  Different types of errors
Accuracy
Base quality drops
 along read
 Sanger
 > SOLiD
 > Illumina
 > 454
 > Helicos
Base calling errors
Main source of error for Illumina, less in
  SOLiD & 454
Homopolymer runs
• Especially 454
  39% of errors are homopolymers
    • A5 motifs: 3.3% error rate
    • A8 motifs: 50% error rate!
    Reason: use signal intensity as a measure for
      homopolymer length
Is it 4? Is it 5? Is it 4?
Consensus accuracy
Increase accuracy for SNP calling by
  increasing coverage
  – Illumina: 20X
  – SOLiD: 12X
  – 454: 7.4X
  – Sanger: 3X

Factors: raw accuracy + read length
2. Mapping: fastq => bam
• Maq and bwa: only 1 mapping
  If multiple: mapQ = 0
  <=> mosaik & mrFAST: alternatives
• Maq and bwa: use paired-end
  information => might prefer correct
  distance over correct alignment
3. Post-mapping tweaking
Improve quality of mapped data:
  –   duplicate removal
  –   baseQ recalibration
  –   read clipping
  –   local realignment around indels

Genome Analysis Toolkit (GATK)
 http://bit.ly/9zIn4b
Duplicate removal
PCR amplification bias
 multiple reads with same start/stop =>
 keep only one (with highest mapping Q)
java -Xmx2048m 
  -jar /path_to_picardtools/MarkDuplicates.jar 
  INPUT=input.bam 
  OUTPUT=output.bam 
  METRICS_FILE=output.metrics 
  VALIDATION_STRINGENCY=LENIENT


                     Picard



                samtools

                        samtools rmdup input.bam output.bam
baseQ recalibration
• Why?
  – correct for variation in quality with machine
    cycle, sequence context, lane, baseQ…
• Steps:
  – Identify what to correct for (create plots)
  – Calculate covariates
  – Apply covariates
  – Check (create plots)
java -Xmx4g -jar GenomeAnalysisTK.jar 
  -l INFO 
  -R resources/Homo_sapiens_assembly18.fasta 
  --DBSNP resources/dbsnp_129_hg18.rod 
  -I my_reads.bam 
  -T CountCovariates 
  -cov ReadGroupCovariate 
  -cov QualityScoreCovariate 
  -cov DinucCovariate 
  -recalFile my_reads.recal_data.csv
java -Xmx4g -jar GenomeAnalysisTK.jar 
  -l INFO 
  -R resources/Homo_sapiens_assembly18.fasta 
  -I my_reads.bam 
  -T TableRecalibration 
   -outputBam my_reads.recal.bam 
   -recalFile my_reads.recal_data.csv
Read clipping
Remove:
 – low quality strings of bases
 – sections of reads
 – reads containing user-provided sequences
Local realignment near indels
Local realignment near indels
java   -Xmx1g -jar /path/to/GenomeAnalysisTK.jar 
  -T   RealignerTargetCreator 
  -R   /path/to/reference.fasta 
  -o   /path/to/output.intervals



java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir 
  -jar /path/to/GenomeAnalysisTK.jar 
  -I input.bam 
  -R ref.fasta 
  -T IndelRealigner 
  -targetIntervals /path/to/output.intervals 
  -o realignedBam.bam
4. SNP calling
• Different callers:
  – samtools
  – GATK UnifiedGenotyper
  – SOAPsnp
  –…
• Read-based => base-based
pileup

1   272   T   24   ,.$.....,,.,.,...,,,.,..^+.   <<<+;<<<<<<<<<<<=<;<;7<&
1   273   T   23   ,.....,,.,.,...,,,.,..A       <<<;<<<<<<<<<3<=<<<;<<+
1   274   T   23   ,.$....,,.,.,...,,,.,...      7<7;<;<<<<<<<<<=<;<;<<6
1   275   A   23   ,$....,,.,.,...,,,.,...^l.    <+;9*<<<<<<<<<=<<:;<<<<
1   276   G   22   ...T,,.,.,...,,,.,....        33;+<<7=7<<7<&<<1;<<6<
1   277   T   22   ..CCggC,C,.C.,,CC,..g.        +7<;<<<<<<<&<=<<:;<<&<
1   278   G   23   ....,,.,.,...,,,.,....^k.     %38*<<;<7<<7<=<<<;<<<<<
1   279   C   23   A..T,,.,.,...,,,.,.....       ;75&<<<<<<<<<=<<<9<<:<<
pileup

1   272   T   24   ,.$.....,,.,.,...,,,.,..^+.   <<<+;<<<<<<<<<<<=<;<;7<&
1   273   T   23   ,.....,,.,.,...,,,.,..A       <<<;<<<<<<<<<3<=<<<;<<+
1   274   T   23   ,.$....,,.,.,...,,,.,...      7<7;<;<<<<<<<<<=<;<;<<6
1   275   A   23   ,$....,,.,.,...,,,.,...^l.    <+;9*<<<<<<<<<=<<:;<<<<
1   276   G   22   ...T,,.,.,...,,,.,....        33;+<<7=7<<7<&<<1;<<6<
1   277   T   22   ..CCggC,C,.C.,,CC,..g.        +7<;<<<<<<<&<=<<:;<<&<
1   278   G   23   ....,,.,.,...,,,.,....^k.     %38*<<;<7<<7<=<<<;<<<<<
1   279   C   23   A..T,,.,.,...,,,.,.....       ;75&<<<<<<<<<=<<<9<<:<<
java 
  -Xmx6g 
  -jar /path_to/GenomeAnalysisTK.jar 
  -l INFO 
  -R human_b36_plus.fasta 
  -I input.bam 
  -T UnifiedGenotyper 
  --heterozygosity 0.001 
  -pl Solexa 
  -varout output.vcf 
  -vf VCF 
  -mbq 20 
  -mmq 10 
  -stand_call_conf 30.0 
  --DBSNP dbsnp_129_b36_plus.rod

                                         GATK
samtools pileup 
  -vcs 
  -r 0.001 
  -l CCDS.txt 
  -f human_b36_plus.fasta 
  input.bam 
  output.pileup


                    samtools
VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE          GT:DP:GQ 1/1:6:45.00
. . .
VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"                                                      header
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE          GT:DP:GQ 1/1:6:45.00
. . .




                  column header                                         data
VCF file

INFO
DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE




FORMAT     a_a:bwa057_b:picard.bam
GT:DP:GQ   1/1:3:36.00
GT:DP:GQ   1/1:6:45.00
Pileup => VCF
Custom scripts, then annotate
             java -Xmx10g 
               -jar GenomeAnalysisTK.jar 
                -T VariantAnnotator 
               --assume_single_sample_reads sample 
               -R human_b36_plus.fasta 
               -D dbsnp_129_b36_plus.rod 
               -I input.bam 
               -B variant,VCF,unannotated.vcf 
               -o annotated.vcf 
               -A AlleleBalance 
               -A MappingQualityZero 
               -A LowMQ 
               -A RMSMappingQuality 
               -A HaplotypeScore 
               -A QualByDepth 
               -A DepthOfCoverage 
               -A HomopolymerRun
5. Filtering
• Aim: to reduce number of false positives
• Options:
  – Depth of coverage
  – Mapping quality
  – SNP clusters
  – Allelic balance
  – Number of reads with mq0
java 
  -Xmx4g 
  -jar GenomeAnalysisTK.jar 
  -T VariantFiltration 
  -R human_b36_plus.fasta 
  -o output.vcf 
  -B variant,VCF,input.vcf 
  --clusterWindowSize 10 
  --filterExpression 'DP < 3 || DP > 1200' 
  --filterName 'DP' 
  --filterExpression 'QUAL < #{qual_cutoff}' 
  --filterName 'QUAL' 
  --filterExpression 'AB > 0.75 && DP > 40' 
  --filterName 'AB'
Filtering - QC metrics (1)
Transition/transversion ratio
                                Random: Ti/Tv = 0.5

                                Whole genome: 2.0-2.1
                                Exome: 3-3.5
Filtering - QC metrics (2)
Number of novel SNPs
  Exome:
  total 20k - 25k;
  novel 1-3k
Combining discovery pipelines
• Mapper: MAQ/bwa/stampy/…
• BaseQ recalibration? Local
  realignment?
• SNP caller: GATK/samtools/SOAPsnp
• Priors for SNP calling: heterozygosity
  (whole genome, exome, dbSNP)
• Filtering
Combining discovery pipelines


      true positives

                               ROC




                       false positives
Combining discovery pipelines
      combinations




                          single

                     better
Indels
Still more tricky than SNPs
  – samtools/dindel/GATK
  – Sample of 10 individuals: on average per
    individual:
     • 2 novel functional high-quality SNPs
     • 18 novel functional high-quality indels

               “I trust manual interpretation of the reads more
               than the basic quality parameters we use”
4 snp_1    STOP_GAINED
1 snp_2    STOP_LOST
1 snp_3    STOP_GAINED
1 snp_4    ESSENTIAL_SPLICE_SITE,INTRONIC
1 snp_5    ESSENTIAL_SPLICE_SITE,INTRONIC
2 snp_6    STOP_GAINED
2 snp_7    STOP_GAINED
1 snp_8    STOP_GAINED
1 snp_9    STOP_GAINED
1 snp_10   STOP_GAINED
1 snp_11   STOP_GAINED
1 snp_12   STOP_GAINED
1 snp_13   STOP_LOST
1 snp_14   STOP_GAINED
4 snp_1         STOP_GAINED
1 snp_2         STOP_LOST
1 snp_3         STOP_GAINED
1 snp_4         ESSENTIAL_SPLICE_SITE,INTRONIC
1 snp_5         ESSENTIAL_SPLICE_SITE,INTRONIC
2 snp_6         STOP_GAINED
2 snp_7         STOP_GAINED
1 snp_8         STOP_GAINED
1 snp_9         STOP_GAINED
1 snp_10        STOP_GAINED
1 snp_11        STOP_GAINED
1 snp_12        STOP_GAINED
1 snp_13        STOP_LOST
1 snp_14        STOP_GAINED

           178 indels    FRAMESHIFT_CODING
Conclusions
Different tools exist and are created
Best to combine (intersect) the results from
  different pipelines
Genome Analysis ToolKit (GATK) provides
  useful bam-file processing tools:
  – Realignment around indels
  – Base quality recalibration
Use in resequencing
•   Identify SNPs/indels
•   Consequences (loss-of-function?)
•   Prevalence in cases/controls
•   Model:
    – Dominant: any het
    – Recessive: homnonref or compound het
References
• Chan E. In: Single Nucleotide Polymorphisms,
  Methods in Molecular Biology 578 (2009)
• McKenna et al. Genome Res 20:1297-1303 (2010)
• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)
• Li H et al. Bioinformatics 25:2078-2079 (2009)
• Li H et al. Genome Res 18:1851-1858 (2008)
Questions?

More Related Content

What's hot

Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
Workhorse Computing
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
Workhorse Computing
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
Workhorse Computing
 
Perl6 in-production
Perl6 in-productionPerl6 in-production
Perl6 in-production
Andrew Shitov
 
Augeas @RMLL 2012
Augeas @RMLL 2012Augeas @RMLL 2012
Augeas @RMLL 2012
Raphaël PINSON
 
The Joy of Smartmatch
The Joy of SmartmatchThe Joy of Smartmatch
The Joy of SmartmatchAndrew Shitov
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebookguoqing75
 
Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
2shortplanks
 
03 - Refresher on buffer overflow in the old days
03 - Refresher on buffer overflow in the old days03 - Refresher on buffer overflow in the old days
03 - Refresher on buffer overflow in the old days
Alexandre Moneger
 
vfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent testsvfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent tests
Frank Kleine
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Module
ast_j
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
Workhorse Computing
 
vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking Sebastian Marek
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
Workhorse Computing
 

What's hot (14)

Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
 
Perl6 in-production
Perl6 in-productionPerl6 in-production
Perl6 in-production
 
Augeas @RMLL 2012
Augeas @RMLL 2012Augeas @RMLL 2012
Augeas @RMLL 2012
 
The Joy of Smartmatch
The Joy of SmartmatchThe Joy of Smartmatch
The Joy of Smartmatch
 
4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook4069180 Caching Performance Lessons From Facebook
4069180 Caching Performance Lessons From Facebook
 
Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
 
03 - Refresher on buffer overflow in the old days
03 - Refresher on buffer overflow in the old days03 - Refresher on buffer overflow in the old days
03 - Refresher on buffer overflow in the old days
 
vfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent testsvfsStream - a better approach for file system dependent tests
vfsStream - a better approach for file system dependent tests
 
Publishing a Perl6 Module
Publishing a Perl6 ModulePublishing a Perl6 Module
Publishing a Perl6 Module
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking vfsStream - effective filesystem mocking
vfsStream - effective filesystem mocking
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
 

Similar to ECCB10 talk - Nextgen sequencing and SNPs

Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryJan Aerts
 
How CPAN Testers helped me improve my module
How CPAN Testers helped me improve my moduleHow CPAN Testers helped me improve my module
How CPAN Testers helped me improve my module
acme
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
Jason Stajich
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
HAMNAHAMNA8
 
Hidden treasures of Ruby
Hidden treasures of RubyHidden treasures of Ruby
Hidden treasures of Ruby
Tom Crinson
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
Uri Laserson
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
Tim Bunce
 
Node Boot Camp
Node Boot CampNode Boot Camp
Node Boot Camp
Troy Miles
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging Basics
Bala Subra
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
James Titcumb
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
DaeHyung Lee
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011CodeIgniter Conference
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
katzgrau
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
Brendan Gregg
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
Prof. Wim Van Criekinge
 

Similar to ECCB10 talk - Nextgen sequencing and SNPs (20)

Next-generation sequencing - variation discovery
Next-generation sequencing - variation discoveryNext-generation sequencing - variation discovery
Next-generation sequencing - variation discovery
 
How CPAN Testers helped me improve my module
How CPAN Testers helped me improve my moduleHow CPAN Testers helped me improve my module
How CPAN Testers helped me improve my module
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Hidden treasures of Ruby
Hidden treasures of RubyHidden treasures of Ruby
Hidden treasures of Ruby
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
Node Boot Camp
Node Boot CampNode Boot Camp
Node Boot Camp
 
Driver Debugging Basics
Driver Debugging BasicsDriver Debugging Basics
Driver Debugging Basics
 
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
Adding 1.21 Gigawatts to Applications with RabbitMQ (PHPNW Dec 2014 Meetup)
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 

More from Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
Jan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
Jan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
Jan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
Jan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
Jan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
Jan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
Jan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Jan Aerts
 

More from Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 

ECCB10 talk - Nextgen sequencing and SNPs

  • 1. Next-generation sequencing and SNPs Jan Aerts Wellcome Trust Sanger Institute jan.aerts@gmail.com
  • 2. Aim To identify the SNP that causes disease, phenotype – Find them all, so you don’t miss it (false negatives) – Not find too many, so it’s useful (false positives)
  • 3. General principle Map reads to reference sequence Convert from read-based to base-based (i.e. pileup) Look at differences
  • 4. This presentation Factors in finding real SNPs – Sequencing technology – Mapping algorithms and initial calling – Post-mapping tweaking – Calling – Filtering Based on experiences in exome resequencing; “experiment 5” on last slide Thomas
  • 5. 1. Sequencing • Provides raw data • Different technologies Different accuracy (critical!) Different types of errors
  • 6. Accuracy Base quality drops along read Sanger > SOLiD > Illumina > 454 > Helicos
  • 7. Base calling errors Main source of error for Illumina, less in SOLiD & 454
  • 8. Homopolymer runs • Especially 454 39% of errors are homopolymers • A5 motifs: 3.3% error rate • A8 motifs: 50% error rate! Reason: use signal intensity as a measure for homopolymer length
  • 9.
  • 10. Is it 4? Is it 5? Is it 4?
  • 11. Consensus accuracy Increase accuracy for SNP calling by increasing coverage – Illumina: 20X – SOLiD: 12X – 454: 7.4X – Sanger: 3X Factors: raw accuracy + read length
  • 12. 2. Mapping: fastq => bam • Maq and bwa: only 1 mapping If multiple: mapQ = 0 <=> mosaik & mrFAST: alternatives • Maq and bwa: use paired-end information => might prefer correct distance over correct alignment
  • 13. 3. Post-mapping tweaking Improve quality of mapped data: – duplicate removal – baseQ recalibration – read clipping – local realignment around indels Genome Analysis Toolkit (GATK) http://bit.ly/9zIn4b
  • 14. Duplicate removal PCR amplification bias multiple reads with same start/stop => keep only one (with highest mapping Q)
  • 15. java -Xmx2048m -jar /path_to_picardtools/MarkDuplicates.jar INPUT=input.bam OUTPUT=output.bam METRICS_FILE=output.metrics VALIDATION_STRINGENCY=LENIENT Picard samtools samtools rmdup input.bam output.bam
  • 16. baseQ recalibration • Why? – correct for variation in quality with machine cycle, sequence context, lane, baseQ… • Steps: – Identify what to correct for (create plots) – Calculate covariates – Apply covariates – Check (create plots)
  • 17.
  • 18. java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta --DBSNP resources/dbsnp_129_hg18.rod -I my_reads.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov DinucCovariate -recalFile my_reads.recal_data.csv
  • 19. java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta -I my_reads.bam -T TableRecalibration -outputBam my_reads.recal.bam -recalFile my_reads.recal_data.csv
  • 20. Read clipping Remove: – low quality strings of bases – sections of reads – reads containing user-provided sequences
  • 23. java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/reference.fasta -o /path/to/output.intervals java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir -jar /path/to/GenomeAnalysisTK.jar -I input.bam -R ref.fasta -T IndelRealigner -targetIntervals /path/to/output.intervals -o realignedBam.bam
  • 24. 4. SNP calling • Different callers: – samtools – GATK UnifiedGenotyper – SOAPsnp –… • Read-based => base-based
  • 25. pileup 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  • 26. pileup 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  • 27. java -Xmx6g -jar /path_to/GenomeAnalysisTK.jar -l INFO -R human_b36_plus.fasta -I input.bam -T UnifiedGenotyper --heterozygosity 0.001 -pl Solexa -varout output.vcf -vf VCF -mbq 20 -mmq 10 -stand_call_conf 30.0 --DBSNP dbsnp_129_b36_plus.rod GATK
  • 28. samtools pileup -vcs -r 0.001 -l CCDS.txt -f human_b36_plus.fasta input.bam output.pileup samtools
  • 29. VCF file ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . .
  • 30. VCF file ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" header ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . . column header data
  • 32. Pileup => VCF Custom scripts, then annotate java -Xmx10g -jar GenomeAnalysisTK.jar -T VariantAnnotator --assume_single_sample_reads sample -R human_b36_plus.fasta -D dbsnp_129_b36_plus.rod -I input.bam -B variant,VCF,unannotated.vcf -o annotated.vcf -A AlleleBalance -A MappingQualityZero -A LowMQ -A RMSMappingQuality -A HaplotypeScore -A QualByDepth -A DepthOfCoverage -A HomopolymerRun
  • 33. 5. Filtering • Aim: to reduce number of false positives • Options: – Depth of coverage – Mapping quality – SNP clusters – Allelic balance – Number of reads with mq0
  • 34. java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_b36_plus.fasta -o output.vcf -B variant,VCF,input.vcf --clusterWindowSize 10 --filterExpression 'DP < 3 || DP > 1200' --filterName 'DP' --filterExpression 'QUAL < #{qual_cutoff}' --filterName 'QUAL' --filterExpression 'AB > 0.75 && DP > 40' --filterName 'AB'
  • 35. Filtering - QC metrics (1) Transition/transversion ratio Random: Ti/Tv = 0.5 Whole genome: 2.0-2.1 Exome: 3-3.5
  • 36. Filtering - QC metrics (2) Number of novel SNPs Exome: total 20k - 25k; novel 1-3k
  • 37.
  • 38. Combining discovery pipelines • Mapper: MAQ/bwa/stampy/… • BaseQ recalibration? Local realignment? • SNP caller: GATK/samtools/SOAPsnp • Priors for SNP calling: heterozygosity (whole genome, exome, dbSNP) • Filtering
  • 39. Combining discovery pipelines true positives ROC false positives
  • 40. Combining discovery pipelines combinations single better
  • 41. Indels Still more tricky than SNPs – samtools/dindel/GATK – Sample of 10 individuals: on average per individual: • 2 novel functional high-quality SNPs • 18 novel functional high-quality indels “I trust manual interpretation of the reads more than the basic quality parameters we use”
  • 42. 4 snp_1 STOP_GAINED 1 snp_2 STOP_LOST 1 snp_3 STOP_GAINED 1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC 1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC 2 snp_6 STOP_GAINED 2 snp_7 STOP_GAINED 1 snp_8 STOP_GAINED 1 snp_9 STOP_GAINED 1 snp_10 STOP_GAINED 1 snp_11 STOP_GAINED 1 snp_12 STOP_GAINED 1 snp_13 STOP_LOST 1 snp_14 STOP_GAINED
  • 43. 4 snp_1 STOP_GAINED 1 snp_2 STOP_LOST 1 snp_3 STOP_GAINED 1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC 1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC 2 snp_6 STOP_GAINED 2 snp_7 STOP_GAINED 1 snp_8 STOP_GAINED 1 snp_9 STOP_GAINED 1 snp_10 STOP_GAINED 1 snp_11 STOP_GAINED 1 snp_12 STOP_GAINED 1 snp_13 STOP_LOST 1 snp_14 STOP_GAINED 178 indels FRAMESHIFT_CODING
  • 44. Conclusions Different tools exist and are created Best to combine (intersect) the results from different pipelines Genome Analysis ToolKit (GATK) provides useful bam-file processing tools: – Realignment around indels – Base quality recalibration
  • 45. Use in resequencing • Identify SNPs/indels • Consequences (loss-of-function?) • Prevalence in cases/controls • Model: – Dominant: any het – Recessive: homnonref or compound het
  • 46. References • Chan E. In: Single Nucleotide Polymorphisms, Methods in Molecular Biology 578 (2009) • McKenna et al. Genome Res 20:1297-1303 (2010) • Li H & Durbin R. Bioinformatics 25:1754-1760 (2009) • Li H et al. Bioinformatics 25:2078-2079 (2009) • Li H et al. Genome Res 18:1851-1858 (2008)