This presentation summarizes methods for detecting identity by descent (IBD) tracts from sequence data and compares the performance of existing algorithms. It introduces IBD and current population-based methods like GERMLINE, FastIBD, and RefinedIBD. A coalescence-based method called SMCSD is described that uses hidden Markov models and can infer shorter IBD tracts. Simulation results on European and African populations show that while existing methods perform well for long tracts, SMCSD has higher recall, precision, and F-score for shorter tracts due to its probabilistic modeling of recombination breakpoints.
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANARoberto Scarafia
BREVE PREMESSA:
Negli ultimi anni si sono sviluppate tecnologie altamente sofisticate che consentono di valutare il rischio per condizioni cromosomiche fetali. L'ampio ventaglio di opzioni oramai disponibili nell'ambito degli screening non invasivi pone numerosi quesiti su quale tecnologia utilizzare e le problematiche specifiche connesse alla tecnologia. Durante questa mezza giornata di aggiornamento verranno dunque presentate in modo semplificato le basi molecolari delle differenti tecnologie coi vantaggi e vantaggi correlati, quali test sono disponibili e il loro livello di certificazione in relazione alla normativa europea inerente alla marchiatura CE-IVD, quali sono le cause di risultati discordanti, dei ‘no results’ e la gestione dei casi con risultato ad alto rischio, no result e discordanze
OBIETTIVI FORMATIVI:
• Descrivere le differenti tecnologie disponibili coi relativi vantaggi e svantaggi;
• Presentare le cause biologiche dei risultati discordanti mediante cfDNA test;
• Illustrare le diverse cause di ‘no result’ e le implicazioni sulle performances del test;
• Descrivere l’utilità delle certificazioni, validazioni dei cfDNA test e dei controlli esterni di
qualità;
• Discutere circa l’utilità clinica dei contenuti aggiuntivi oltre alle trisomie 21,18,13;
• Discutere circa il follow-up e il management dei risultati ad alto rischio, dei no results e dei
risultati discordanti.
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...TELKOMNIKA JOURNAL
Problem in imbalanced data is very common in classification or identification. The problem is
raised when the number of instances of one class far exceeds the other. In the previous research, our
DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset.
The number of samples of Tuna and Mackerel were much more than those of other fish samples.
Therefore, the accuracy of the classification model was probably still in bias. This research aimed at
employing Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used kmers
frequencies from DNA barcode sequences as features and Support Vector Machine (SVM) as
classification method. In this research we used trinucleotide (3-mers) and tetranucleotide (4-mers). The
training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared
the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences
which is taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results
showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than
those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected
that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode
based fish classification system, particularly in the species level which is difficult to be identified.
Learning classifiers from discretized expression quantitative trait lociNTNU
Expression quantitative trait loci are used as a tool to iden- tify genetic causes of natural variation in gene expression. Only in a few cases the expression of a gene is controlled by a variant on a single marker. There is a plethora of different complexity levels of interaction ef- fects within markers, within genes and between marker and genes. This complexity challenges biostatisticians and bioinformatitians every day and makes findings difficult to appear. As a way to simplify analysis and better control confounders, we tried a new approach for associa- tion analysis between genotypes and expression data. We pursued to understand whether discretization of expression data can be useful in genome-transcriptome association analyses. By discretizing the depen- dent variable, algorithms for learning classifiers from data as well as performing block selection were used to help understanding the relation- ship between the expression of a gene and genetic markers. We present the results of a first set of studies in which we used this approach to de- tect new possible causes of expression variation of DRB5, a gene playing an important role within the immune system. A supplementary website including a link to the software with the method implemented can be found at http://bios.ugr.es/classDRB5.
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Genomika Diagnósticos
Poster Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted Sequencing
Authors: Marcel Caraciolo, Murilo Cervato, George Carvalho and Wilder Galvão.
CELL - FREE DNA TEST: ASPETTI EMERGENTI NELLA PRATICA QUOTIDIANARoberto Scarafia
BREVE PREMESSA:
Negli ultimi anni si sono sviluppate tecnologie altamente sofisticate che consentono di valutare il rischio per condizioni cromosomiche fetali. L'ampio ventaglio di opzioni oramai disponibili nell'ambito degli screening non invasivi pone numerosi quesiti su quale tecnologia utilizzare e le problematiche specifiche connesse alla tecnologia. Durante questa mezza giornata di aggiornamento verranno dunque presentate in modo semplificato le basi molecolari delle differenti tecnologie coi vantaggi e vantaggi correlati, quali test sono disponibili e il loro livello di certificazione in relazione alla normativa europea inerente alla marchiatura CE-IVD, quali sono le cause di risultati discordanti, dei ‘no results’ e la gestione dei casi con risultato ad alto rischio, no result e discordanze
OBIETTIVI FORMATIVI:
• Descrivere le differenti tecnologie disponibili coi relativi vantaggi e svantaggi;
• Presentare le cause biologiche dei risultati discordanti mediante cfDNA test;
• Illustrare le diverse cause di ‘no result’ e le implicazioni sulle performances del test;
• Descrivere l’utilità delle certificazioni, validazioni dei cfDNA test e dei controlli esterni di
qualità;
• Discutere circa l’utilità clinica dei contenuti aggiuntivi oltre alle trisomie 21,18,13;
• Discutere circa il follow-up e il management dei risultati ad alto rischio, dei no results e dei
risultati discordanti.
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...TELKOMNIKA JOURNAL
Problem in imbalanced data is very common in classification or identification. The problem is
raised when the number of instances of one class far exceeds the other. In the previous research, our
DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset.
The number of samples of Tuna and Mackerel were much more than those of other fish samples.
Therefore, the accuracy of the classification model was probably still in bias. This research aimed at
employing Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used kmers
frequencies from DNA barcode sequences as features and Support Vector Machine (SVM) as
classification method. In this research we used trinucleotide (3-mers) and tetranucleotide (4-mers). The
training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared
the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences
which is taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results
showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than
those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected
that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode
based fish classification system, particularly in the species level which is difficult to be identified.
Learning classifiers from discretized expression quantitative trait lociNTNU
Expression quantitative trait loci are used as a tool to iden- tify genetic causes of natural variation in gene expression. Only in a few cases the expression of a gene is controlled by a variant on a single marker. There is a plethora of different complexity levels of interaction ef- fects within markers, within genes and between marker and genes. This complexity challenges biostatisticians and bioinformatitians every day and makes findings difficult to appear. As a way to simplify analysis and better control confounders, we tried a new approach for associa- tion analysis between genotypes and expression data. We pursued to understand whether discretization of expression data can be useful in genome-transcriptome association analyses. By discretizing the depen- dent variable, algorithms for learning classifiers from data as well as performing block selection were used to help understanding the relation- ship between the expression of a gene and genetic markers. We present the results of a first set of studies in which we used this approach to de- tect new possible causes of expression variation of DRB5, a gene playing an important role within the immune system. A supplementary website including a link to the software with the method implemented can be found at http://bios.ugr.es/classDRB5.
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Genomika Diagnósticos
Poster Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted Sequencing
Authors: Marcel Caraciolo, Murilo Cervato, George Carvalho and Wilder Galvão.
1. Introduction Methods Results Conclusions
Sequence Based Identity by Descent Detection
Joint work with Jasmine Nirody & Yun S. Song
@ University of California, Berkeley
Paula Tataru
Mols Meeting
August 15, 2013
Sequence Based IBD detection 1
3. Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Identity By Descent (IBD) tracts
DNA segments that are inherited from a common ancestor
recombination disrupts them
expected length depends on the TMRCA
Sequence Based IBD detection 3
4. Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
IBD is fundamental in genetics
selection
phasing
imputation
association studies
Sequence Based IBD detection 4
5. Introduction Methods Results Conclusions
G
G
A
A
C
C
T
T
G
G
A
A
G
A
C
C
Current methods use population-wide SNP genotype data
work best for recent IBD (longer than 1cM)
different IBD definitions
pairwise SNPs disrupt predicted IBD tracts
probabilistic, deterministic
Sequence Based IBD detection 5
6. Introduction Methods Results Conclusions
GERMLINE
Gusev et al., 2009
Identical By State (IBS)
Deterministic
Linear in number of samples
Phased SNP data
Sliding window to find IBS
Allows for genotyping error
Sequence Based IBD detection 6
7. Introduction Methods Results Conclusions
FastIBD
Browning & Browning, 2011
IBD inside IBS
Deterministic
Quadratic in number of samples
Unphased SNP data; phasing done with Beagle
Accounts for phase uncertainty and background levels of LD
Models shared haplotype frequencies
Sequence Based IBD detection 7
8. Introduction Methods Results Conclusions
RefinedIBD
Browning & Browning, 2013
IBD inside IBS
Probabilistic
Quadratic in number of samples
Very similar to FastIBD
Identifies candidate IBD segments using GERMLINE
Filter candidates based on a probabilistic model
Sequence Based IBD detection 8
9. Introduction Methods Results Conclusions
SMCSD
Paul et al., 2011, Sheehan et al., 2013
same TMRCA
Probabilistic: HMM
Quadratic in number of samples
Phased sequence data
Based on coalescence theory
Predicts recombination breakpoints that change TMRCA
Sequence Based IBD detection 9
10. Introduction Methods Results Conclusions
SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model
Sequence Based IBD detection 10
11. Introduction Methods Results Conclusions
SMCSD in a nutshell
Designed to estimate demographic history
partition time in discrete intervals
assume constant population size per time interval
use EM to train model
Use decoding to infer IBD
assume demography given
run posterior decoding
changes of TMRCA reveal recombination breakpoints
use posterior probabilities to trim tracts’ endpoints
Sequence Based IBD detection 10
12. Introduction Methods Results Conclusions
Data simulation
Simulate trees in ms
µ = 1.25 × 10−8
r = 10−8
sequences of length 10MB
10 sequences (45 pairs)
10 replicates
Collect recombination breakpoints from ms output
Reconstruct pairwise IBD tracts
Sequence Based IBD detection 11
13. Introduction Methods Results Conclusions
Human Population
Tenessen et al., 2012, Simons et al., 2013
Sequence Based IBD detection 12
14. Introduction Methods Results Conclusions
Human Population
0.
0.0
0.5
1.0
CumProb
0 1000 2000 3000 4000 5000 6000
Generations back in time
103
104
105
106
PopSize
EA EA Watt A
Sequence Based IBD detection 13
17. Introduction Methods Results Conclusions
Conclusion
Simulated data from outbred populations
Existing programs are strong performers for long tracts
SMCSD performs better on shorter tracts
SMCSD uses a more robust IBD definition
Sequence Based IBD detection 16