Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
Introduction to next generation sequencing (NGS); NGS data; data management of NGS data; third generation sequencing; NGS pipelines; NGS experimental design
Presentation carried out by Sophia Derdak, from the Data Analysis Team at CNAG, at the course "Identification and analysis of sequence variants in sequencing projects: fundamentals and tools".
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
Introduction to next generation sequencing (NGS); NGS data; data management of NGS data; third generation sequencing; NGS pipelines; NGS experimental design
Presentation carried out by Sophia Derdak, from the Data Analysis Team at CNAG, at the course "Identification and analysis of sequence variants in sequencing projects: fundamentals and tools".
Next-generation sequencing: Data mangementGuy Coates
Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.
Talk given at BIO-IT World, Europe 2010.
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataAdrian Baez-Ortega
http://iongap.hpc.iter.es
Computer Engineer Degree Final Project.
Universidad de La Laguna, Spain, July 2014.
Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads.
IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
This presents a number of case studies on the application on high-throughput sequencing (HTS), next generation sequencing (NGS), to biological problems ranging from human genome sequencing, identification of disease mutations, metagenomics, virus discovery, epidemic, transmission chains and viral populations. Presented at the University of Glasgow on Friday 26th June 2015.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataAdrian Baez-Ortega
http://iongap.hpc.iter.es
Computer Engineer Degree Final Project.
Universidad de La Laguna, Spain, July 2014.
Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads.
IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
This presents a number of case studies on the application on high-throughput sequencing (HTS), next generation sequencing (NGS), to biological problems ranging from human genome sequencing, identification of disease mutations, metagenomics, virus discovery, epidemic, transmission chains and viral populations. Presented at the University of Glasgow on Friday 26th June 2015.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
This slidedeck discusses the most biologically efficient, cost-effective method for successful NGS. The GeneRead DNA QuantiMIZE Kits enable determination of the optimum conditions for targeted enrichment of DNA isolated from biological samples, while the GeneRead DNAseq Panels V2 allow you to quickly and reliably deep sequence your genes of interest. Applications in translational and clinical research are highlighted.
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
Presentation at BOSC2012 by J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
Presentation at BOSC2012 by P Rocca-Serra - The open source ISA metadata tracking framework: from data curation and management at the source, to the linked data universe
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
5. What are SNPs and why are they important?
• SNP = single nucleotide polymorphism
• It’s the differences that matter:
• Human vs chimp: 98% identical (<2 differences every 100bp)
• Between any 2 individuals: 1 difference every 1000bp
• Disease: A or G == life or death
• Mutations can result in:
• change in level of transcription or translation (loss/gain)
• change in protein structure
5
16. pileup
16
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
alignment mapping quality
17. Intermezzo: quality scores
“Phred-score”: used for sequence quality as well as mapping quality
Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred-
score = 30
Chance of 1/100 that read is mapped at wrong position = 10-2 => phred-
score = 20
Sanger encoding: quality score 30 = “>”
17
18. pileup
18
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
19. Heterozygous SNPs and the binomial distribution
SNPs are bi-allelic => allele combinations for heterozygous SNP follow
binomial distribution
outcome = binary (red/white, head/tail, yes/no, A/G)
probability p of the outcome of a single draw is the same for all draws
E.g. 8 A’s + 12 G’s = SNP?
hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8;
probability p of outcome in single draw = 0.5
table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob
8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05
=> heterozygote
19
22. VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
22
23. VCF file
23
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
file header
column header
actual data
27. We have: chromosome + position + alleles
We need:
• in gene?
• damaging?
will be basis for filtering
SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ...
27
33. VCF file
33
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
36. Factors that influence SNP accuracy
• sequencing technology
• mapping algorithms and parameters
• post-mapping manipulation
duplicate removal, base quality recalibration, local realignment around
indels, ...
• SNP calling algorithms and parameters
36
38. Filtering to find likely candidates
Which are the most interesting?
• only highqual: DP, QUAL, AB, but keep eye on Ti/Tv
• novel
• loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non-
synonymous)
• found in multiple individuals
• conserved
• homozygous non-reference or compound heterozygous
38
39. Disease model
• dominant: a single heterozygous SNP is damaging
• recessive: either homozygous non-reference or compound heterozygous
necessary to lead to disease phenotype
(e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead
to: mental retardation, microcephaly, ...)
39
41. Why bother?
Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004
Redon et al, Nature 2006: 12% of genome is covered by copy number variable
regions (270 individuals) => more nucleotide content per genome than SNPs
•colour vision in primates
•CCL3L1 copy number -> susceptibility to HIV
•AMY1 copy number -> diet
=> “the dynamic genome”
41
47. Types of structural variation
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
47
48. Types of structural variation
48
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
CNV = Copy Number Variation
49. Copy number variation (CNV)
Not equally distributed over genome: more pericentromeric and subtelomeric
(especially in primates)
Pericentromeric & subtelomeric regions: bias towards interchromosomal
rearrangements; interstitial regions: bias towards intrachromosomal
Generation of duplications:
pericentromeric: 2-stage model (Sharp & Eichler, 2006)
1. series of seeding events: one of more progenitor loci transpose together
to pericentromeric receptor => generates mosaic block of duplicated
segments derived from different loci
2. inter- & intrachromosomal duplication => large blocks are duplicated
near other centromeres
subtelomeric: due to normal recombination: cross-overs lead to translocation
of distal sequences between chromosomes
49
50. Copy number variation and segmental duplications
Close relationship between CNVs and segmental duplications (aka low-copy repeats
aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least
90% sequence similarity):
• Copy number variation that is fixed in population = segmental duplication (in other
words: segmental duplications started out themselves as copy number variations)
• Segmental duplications can stimulate formation of new CNVs due to NAHR (see
later)
➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap
with segmental duplications
➡80% of human segmental duplications arose after the divergence of Great Aples from
the rest of the primates
50
53. Mechanisms: NAHR
53
NAHR = non-allelic homologous
recombination
often between segmental
duplications
•can recur
•clustered breakpoints
•larger
Hastings et al, 2009
54. Mechanisms: NHEJ
54
Gu et al, 2008
NHEJ = non-homologous end-joining
pathway to repair double-strand breaks, but may lead to
translocations and telomere fusion
not associated with segmental duplications
• more scattered
• unique origins
• smaller
55. Mechanisms: FoSTeS
55
Hastings et al, 2009
FoSTeS = DNA replication fork-stalling and
template switching
can occur multiple times in series => can generate
very complex rearrangements
56. Feuk et al, 2006
Discovery of structural variation
56
57. Approaches for discovery
• karyotyping, fluorescent in situ hybridization (FISH)
• array comparative genomic hybridization (aCGH)
• next-generation sequencing: combination of:
• read pair information
• read depth information
• split read information
• for fine-mapping breakpoints: local assembly
=> identify signatures
57
58. Feuk et al, 2006
Feuk et al, 2006
Feuk et al, 2006
FISH = fluorescent in-silico hybridization
duplication
inversion
duplication
Structural variation discovery using FISH
58
65. General principle
• Similar to aCGH: using reference RD file (e.g. from 1000Genomes Project)
• In theory: higher resolution, but noisier than aCGH
• Algorithms not mature yet
• More complex steps
➡Data binned
65
69. 69
CNV = copy number variation
Combining CNV data for >1 individuals/samples
70. 70
CNVR = copy number variation region
CNVR = any region covered by at least 1 CNV
71. 71
CNVE = copy number variation event
CNVE = subgroups of CNVR with >= 50% reciprocal overlap
72. Data normalization
• Mainly: GC
• Other: repeat-rich regions, mapping Q, ...
• Fit linear model GC-content and RD => noise decreases
72
73. Segmentation
• Identify spikes
• Many segmentational algorithms, e.g. GADA
• Issues: setting parameters: when to cut off peaks?
• Combine outputs from different runs with different parameters
• Compare to known CNVs
73
78. Drawbacks
• Can only find unbalanced structural variation (i.e. CNVs)
• How to assess specificity and sensitivity? => compare with known CNVs
• Database of Genomic Variants DGV (http://projects.tcag.ca/variation/)
• Decipher (http://decipher.sanger.ac.uk/)
• Breakpoints: unknown
• Different parameters for rare vs common CNVs => which?
78
80. Discordant readpairs
• Orientation
• Distance
• Plot insert size distribution for chromosome
• Very long tail!! => difficult to set cutoff: 4 MAD or 0.01%?
80
89. Clustering
• “standard clustering strategy”
• only consider mate pairs that do not have concordant mappings
• ignore read pairs that have more than one good mapping
• clustering: use insert size distribution (e.g. 2x4 MAD)
89
90. Clustering: issues
• Ignores pairs that have >1 good mapping => no detection within repetitive
regions (segmental duplications)
• What cutoff for what is considered abnormal distance? (4 MAD? 0.01%?
2stdev?)
• Low library quality of mix of libraries => multiple peaks in size distribution
90
91. Filtering
• On number of RPs per cluster
• normally: n = 2
• for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5
• On drop in read depth and split reads
• On (mappingQ x nrRP)
• if published data available: look at specificity and sensitivity for different cutoffs
mQ x nrRP
• if not: very difficult
91
92. Filtering: issues
• Large insert size: low resolution for detecting breakpoints
• Small insert size: low resolution for detecting complex regions
92
94. Mapping
• short subsequences => many possible mappings
• solution: “anchored split mapping” (e.g. Pindel)
94
Medvedev et al, 2009
95. Local reassembly
• Aim: to determine breakpoints
• Which reads?
• for deletions: local reads
• for insertions: hanging reads for read pairs with only one read mapped
• (rather not: unmapped reads)
• For large region: split up
95
98. 98
+ -
read depth
read pairs
split reads
conceptually simple
only unbalanced (CNVs)
low resolution
wide range of types of
variation
complicated
basepair resolution very small reads
General conclusions NGS & structural variation (1)
99. General conclusions NGS & structural variation (2)
• Available algorithms: more to demonstrate technique than comprehensive
solution
• Difficult => different software = different results => “consensus set”
• based on read pairs and split reads: many sets agree
• based on read depth: totally different
• sometimes drop in read depth, but no aberrant read pairs spanning the region
=> why???
• Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik,
mrFAST: return more results
99
104. References and software
• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
• Lee S et al. Bioinformatics 24:i59-i67 (2008)
• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
• Campbell P et al. Nat Genet 40:722-729 (2008)
• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
• Chen K et al. Genome Res 19:1527-1741 (2009)
• Yoon S et al. Genome Res 19:1586-1592 (2009)
• Du J et al PLoS Comp Biol 5(7):e1000432 (2009)
• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)
• Hastings P et al Nat Rev Genet 10:551-564 (2009)
104
106. Finding SNPs using Galaxy
Based on the SAM-file you created in Galaxy in the last lecture, create a list of
SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and
finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter
only return variants where the coverage is larger than 3 and the base quality is
larger than 20.
How many SNPs do you find?
Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered
file you just created)
106
107. Finding SNPs using samtools
Using the SAM file you created in the last lecture on the linux command line:
Generate a BAM file and sort it. Next, generate a pileup for that BAM file using
~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print
the variant sites and also compute the reference sequence (run “samtools
pileup” without arguments to get more info).
How many SNPs are identified? Is the SNP at position 139,391,636
heterozygous or homozygous-non-reference? And the one at 139,399,365? Do
you trust the SNP at 139,401,304?
107
108. Annotating and filtering SNPs
Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to
the SIFT website at http://bit.ly/siftsnp. Positions in this file are on Homo
sapiens build NCBI36. Make sure to let SIFT send the results by email.
How many SNPs are in/near genes?
How many are in exons?
What percentage of the SNPs is predicted damaging?
108
109. Structural variation
We’ll be looking at copy number variation using the cnv-seq package. This
software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/
We’ll be running the example from the cnv-seq tutorial at http://
tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!)
•Log into the server mentioned on Toledo.
•Calculate CNVs in the file ~jaerts/i0d51a/test_1.hits compared to ~jaerts/
i0d51a/ref_1.hits:
/mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits --
genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4
•Finally investigate in R. Start R by typing “R”. Then:
library(cnv)
data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’)
cnv.print(data)
cnv.summary(data)
plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6)
ggsave(’sample_1.pdf’)
•Describe the main features in the plot.
109