In this research paper from the Spring 2015 semester, I described my analysis of certain genome scaffolds, or gaps within the Malaclemys terrapin genome. I examined seven of these scaffolds and determined their approximate sizes through Polymerase Chain Reaction (PCR) and Gel Electrophoresis. The DNA was then prepped to be sent for sequencing by an external source. The resulting chromatograms gave inconclusive results on the exact sequences of these scaffolds.
Cancer therapies that target specific pathways can be more effective than established, nonspecific chemotherapy and radiation treatments, and may prevent side effects on healthy tissues. Such targeted therapies can only be applied after underlying gene mutations have been identified. However, detecting low frequency variants from clinically relevant samples poses significant challenges. Specimens are routinely formalin-fixed and paraffin-embedded (FFPE) for histology, which can decrease the efficiency of NGS library preparation. In this presentation, we discuss approaches for extraction of DNA from FFPE samples, and recommend quality control assays to guide parameter selection for library construction and sequencing depth.
This document summarizes genetic variation projects like HapMap and 1000 Genomes that aimed to catalog common human genetic variants. It describes types of variation like SNPs and how factors like selection and recombination influence their distribution. It provides overviews of the HapMap and 1000 Genomes projects, including their goals, populations studied, methods, and data formats. The information from these projects can be used to study traits, diseases, and human history.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
Microarrays allow researchers to analyze gene expression across thousands of genes simultaneously. DNA probes are arrayed on a small glass or nylon slide, and labeled mRNA from samples is hybridized to the probes. Fluorescent scanning detects which genes are expressed. Data analysis includes normalization, distance metrics, clustering, and visualization to group genes with similar expression profiles and identify patterns of co-regulated genes. Microarrays enable functional genomics studies of development, disease, response to drugs or environmental factors, and more.
Modeling DNA Amplification by Polymerase Chain Reaction (PCR)Danielle Snowflack
The objective of this lesson is for students to gain hands-on experience of the principles and practice of Polymerase Chain Reaction (PCR). At the completion of this activity, students should understand the process by which PCR amplifies DNA.
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Thermo Fisher Scientific
Presented by Jennifer D. Churchill, PhD during a special Lunch and Learn session during the American Academy of Forensic Sciences (AAFS) 67th annual conference, February 2015. / Conclusions
• Robust panels of identity and ancestry SNPs
• Robust STR panel
• Whole genome mtDNA sequencing
• Highly informative
• Sensitive
• Quantitative – scaling comparison
• Low density chip is not necessarily a bad chip
• Wide range of density can still yield high quality data
• Based on results continue development and validation
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
In this research paper from the Spring 2015 semester, I described my analysis of certain genome scaffolds, or gaps within the Malaclemys terrapin genome. I examined seven of these scaffolds and determined their approximate sizes through Polymerase Chain Reaction (PCR) and Gel Electrophoresis. The DNA was then prepped to be sent for sequencing by an external source. The resulting chromatograms gave inconclusive results on the exact sequences of these scaffolds.
Cancer therapies that target specific pathways can be more effective than established, nonspecific chemotherapy and radiation treatments, and may prevent side effects on healthy tissues. Such targeted therapies can only be applied after underlying gene mutations have been identified. However, detecting low frequency variants from clinically relevant samples poses significant challenges. Specimens are routinely formalin-fixed and paraffin-embedded (FFPE) for histology, which can decrease the efficiency of NGS library preparation. In this presentation, we discuss approaches for extraction of DNA from FFPE samples, and recommend quality control assays to guide parameter selection for library construction and sequencing depth.
This document summarizes genetic variation projects like HapMap and 1000 Genomes that aimed to catalog common human genetic variants. It describes types of variation like SNPs and how factors like selection and recombination influence their distribution. It provides overviews of the HapMap and 1000 Genomes projects, including their goals, populations studied, methods, and data formats. The information from these projects can be used to study traits, diseases, and human history.
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
This workshop will address critical issues related to Transcriptomics data:
Processing raw Next Generation Sequencing (NGS) data:
1. Next Generation Sequencing data preprocessing:
Trimming technical sequences
Removing PCR duplicates
2. RNA-seq based quantification of expression levels:
Conventional pipelines (looking at known transcripts)
Identification of novel isoforms
Analysis of Expression Data Using Machine Learning:
3. Unsupervised analysis of expression data:
Principal Component Analysis
Clustering
4. Supervised analysis:
Differential expression analysis
Classification, gene signature construction
5. Gene set enrichment analysis
The workshop will include hands-on exercises utilizing public domain datasets:
breast cancer cell lines transcriptomic profiles (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-10-r110),
patient-derived xenograft (PDX) mouse model of tumor and stroma transcriptomic profiles (http://www.oncotarget.com/index.php?journal=oncotarget&page=article&op=view&path[]=8014&path[]=23533), and
processed data from The Cancer Genome Atlas samples (https://cancergenome.nih.gov/).
Team: The workshops are designed by the researchers at the Tauber Bioinformatics Research Center at University of Haifa, Israel in collaboration with academic centers across the US. Technical support for the workshops is provided by the Pine Biotech team. https://edu.t-bio.info/a-critical-approach-to-transcriptomic-data-analysis/
Microarrays allow researchers to analyze gene expression across thousands of genes simultaneously. DNA probes are arrayed on a small glass or nylon slide, and labeled mRNA from samples is hybridized to the probes. Fluorescent scanning detects which genes are expressed. Data analysis includes normalization, distance metrics, clustering, and visualization to group genes with similar expression profiles and identify patterns of co-regulated genes. Microarrays enable functional genomics studies of development, disease, response to drugs or environmental factors, and more.
Modeling DNA Amplification by Polymerase Chain Reaction (PCR)Danielle Snowflack
The objective of this lesson is for students to gain hands-on experience of the principles and practice of Polymerase Chain Reaction (PCR). At the completion of this activity, students should understand the process by which PCR amplifies DNA.
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Thermo Fisher Scientific
Presented by Jennifer D. Churchill, PhD during a special Lunch and Learn session during the American Academy of Forensic Sciences (AAFS) 67th annual conference, February 2015. / Conclusions
• Robust panels of identity and ancestry SNPs
• Robust STR panel
• Whole genome mtDNA sequencing
• Highly informative
• Sensitive
• Quantitative – scaling comparison
• Low density chip is not necessarily a bad chip
• Wide range of density can still yield high quality data
• Based on results continue development and validation
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
The 1000 Genomes Project aims to model sequencing error processes to improve SNP calling accuracy. Features like coverage, strand bias, trinucleotide mutation rates, and quality scores are explored as error predictors. Validation of over 1000 SNPs genotyped in multiple individuals found that coverage over 100 has a high false positive rate. Non-reference alleles also show strand imbalance. Mutation rates are highest for CpG dinucleotides. The trinucleotide GT* has a 2-3 times higher observed miscall rate than expected based on quality scores, suggesting the need for sequence context calibration in the SNP calling algorithm. Future work will incorporate error rate predictors and explore a new 4-probability model-based approach.
Improving the accuracy of k-means algorithm using genetic algorithmKasun Ranga Wijeweera
This document proposes using a genetic algorithm to improve the accuracy of k-means clustering by selecting better initial centroids. It describes generating an initial population of random centroids, evaluating their fitness using a k-means objective function, and evolving the population over generations using selection, crossover and mutation to converge on high-fitness initial centroids. Testing showed this genetic k-means approach produced more accurate and globally optimized clustering results than random initial centroids.
The document provides an overview of real-time PCR (polymerase chain reaction). It discusses extracting RNA from tissue, converting the RNA to cDNA using reverse transcriptase, performing real-time PCR, and analyzing the results. Several key steps are described, including the importance of RNA quality, using appropriate reverse transcriptase primers and PCR primers, including necessary controls, and selecting appropriate reference standards for normalization.
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow
This document discusses methods for quantifying and analyzing microRNAs (miRNAs) using quantitative PCR (qPCR). It presents a new two-tailed RT-qPCR method that provides high sensitivity and specificity for detecting miRNAs, including discrimination of miRNA isoforms. The method allows unlimited multiplexing in the reverse transcription step followed by singleplex qPCR. The document benchmarks the two-tailed RT-qPCR method on biological samples, showing it can sensitively detect less than 10 molecules and maintain specificity across the entire miRNA sequence. It also demonstrates two-tube multiplexing of the method to profile expression levels of several miRNAs in different tissues.
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...Alexander Gorban
The document discusses whether fractional norms and quasinorms can help overcome the curse of dimensionality. It analyzes three measures of classification accuracy for different values of p in the Minkowski distance. The results show that fractional quasinorms with small p have higher relative contrast and variation, but do not necessarily improve KNN classification performance. In fact, values of p around 0.5, 1, and 2 generally perform best, while extremely small or large p values perform worse. Therefore, the conclusion is that fractional quasinorms do not overcome the curse of dimensionality for classification problems.
Generalizing phylogenetics to infer patterns predicted by processes of divers...Jamie Oaks
This document summarizes a presentation about generalizing phylogenetics to infer patterns predicted by diversification processes. The presentation discusses how phylogenetics is becoming the statistical foundation of biology, and how "big data" presents opportunities to study biology in light of phylogeny. It notes that current models assume independent diversification across lineages, but many processes like biogeography or genome evolution can affect multiple lineages simultaneously. Accounting for shared divergences between lineages could improve inference and provide a framework for studying co-diversification processes. Challenges include developing likelihoods for genomic data and handling large numbers of possible tree topologies. The presentation describes approaches like using the coalescent to integrate over gene trees analytically rather
This document discusses genome sequencing and three approaches to sequencing genomes: hierarchical shotgun sequencing, shotgun sequencing, and de novo whole genome sequencing. It describes key concepts in genome assembly such as contigs, supercontigs/scaffolds, sequence coverage, and physical coverage. It also provides an example of a sequencing read.
Real-time PCR is a technique that monitors DNA amplification during the PCR process in real-time using fluorescence detection. It allows for both quantification of DNA present and detection of DNA amplification as it occurs. Real-time PCR has advantages over traditional PCR such as higher sensitivity, specificity, and ability to provide quantitative results. It uses sequence-specific DNA probes labeled with fluorescent dyes and quenchers to detect amplification of target DNA sequences. Data analysis can provide both absolute and relative quantification of DNA targets. Real-time PCR has many applications including gene expression analysis, disease diagnosis, and food and environmental testing.
Genetic algorithms are a type of evolutionary algorithm that mimics natural selection. They operate on a population of potential solutions applying operators like selection, crossover and mutation to produce the next generation. The algorithm iterates until a termination condition is met, such as a solution being found or a maximum number of generations being produced. Genetic algorithms are useful for optimization and search problems as they can handle large, complex search spaces. However, they require properly defining the fitness function and tuning various parameters like population size, mutation rate and crossover rate.
This talk outlines the general steps for project management in rapid development (design to data in two weeks) of a novel digital PCR assay to validate and quantify low frequency variants discovered by sequencing (NGS) of a targeted comprehensive cancer gene panel by the Ion Torrent PGM on PDX models of metastatic colon cancer and spheroid (3-D) cell cultures.
Goal: If the potential driver mutation is validated, treat both (PDX model & cell culture) with small molecule drugs, investigate coincident response.
Characterization of Novel ctDNA Reference Materials Developed using the Genom...Thermo Fisher Scientific
This document summarizes the development and characterization of novel circulating tumor DNA (ctDNA) reference materials. Fragmented DNA containing single or multiple cancer hotspot mutations was spiked into normal human plasma at defined allelic frequencies ranging from 0.1-50%. The size, concentration, and stability of the reference materials were analyzed. Results showed the materials had a mean size of ~160bp and allelic frequencies matched the expected values. Stability testing demonstrated the ctDNA controls were stable in plasma for up to 15 months. The reference materials were developed to enable simpler validation and quality control of ctDNA detection tests.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arth...Fatma Sayed Ibrahim
My M.Sc. dissertation defense. The title is "Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arthritis Data Based on Haplotype Blocks"
This document summarizes the process used to benchmark large deletion calls from multiple sequencing technologies and bioinformatics pipelines. Researchers merged deletion calls from 14 datasets into regions and evaluated call size accuracy. Calls supported by two or more technologies were identified as draft benchmark calls. Sensitivity to these calls was calculated for each method. The results provide insight into strengths and weaknesses of different approaches to structural variant detection.
This document provides information about measuring cell cycle by flow cytometry using DNA staining. Propidium iodide (PI) and DAPI are fluorescent dyes that bind DNA in a stoichiometric manner, allowing identification of cells in different phases of the cell cycle based on DNA content. Doublet discrimination is important to exclude cell aggregates and ensure accurate cell cycle analysis. Parameters such as area, height, and width measured by the flow cytometer help distinguish single cells from doublets based on their DNA staining characteristics. Proper sample preparation and instrument setup are essential for obtaining high quality cell cycle data.
Human identification from DNA is typically based
on 13 short-tandem repeat (STR) alleles. Commercial kits used in forensic casework rely on the detection of these alleles in DNA samples acquired from an individual. However, the process itself is slow (it can take up to 2 days when conducting a laboratory analysis or 1 hour when using Rapid DNA systems) and has been designed to operate on pristine DNA samples. The need for
achieving fast and accurate DNA processing has spurred efforts in developing portable systems that can reduce the processing time to less than 1 hour. But such systems are expected to operate on degraded DNA samples due to the architecture and process used by the instrument. Consequently, detecting the alleles in such degraded DNA samples can be a challenging problem. In this paper, we present an algorithm to detected allelic peaks from degraded DNA signals based on an adaptive signal processing scheme.
Real-time quantitative PCR (qPCR) is a preferred platform for high throughput gene expression profiling, where large numbers of samples are characterized for hundreds of expression markers. Technically, the qPCR measurements are performed in the same way as when classical qPCR is used to analyze only a few targets per sample, but the higher throughput introduces additional sources of potential confounding variation that must be controlled for. In this presentation, Dr Kubista describes how high throughput qPCR profiling studies are designed. He covers assay optimization and validation, sample quality testing, and how to merge multi-plate measurements into a common analysis. Dr Kubista also discusses how to cost-effectively measure and compensate for background due to genomic DNA.
This document provides an introduction to RNA sequencing (RNA-Seq) applications using next-generation sequencing technologies. It discusses how RNA-Seq can be used to identify which genes are expressed, detect differential gene expression between samples, identify splicing isoforms, and detect genetic variants and structural variations. The document reviews Illumina sequencing by synthesis, the most common platform, outlining the work flow from sample acquisition, RNA extraction and library preparation to sequencing. It also discusses considerations for different sample types and extraction methods.
The 1000 Genomes Project aims to model sequencing error processes to improve SNP calling accuracy. Features like coverage, strand bias, trinucleotide mutation rates, and quality scores are explored as error predictors. Validation of over 1000 SNPs genotyped in multiple individuals found that coverage over 100 has a high false positive rate. Non-reference alleles also show strand imbalance. Mutation rates are highest for CpG dinucleotides. The trinucleotide GT* has a 2-3 times higher observed miscall rate than expected based on quality scores, suggesting the need for sequence context calibration in the SNP calling algorithm. Future work will incorporate error rate predictors and explore a new 4-probability model-based approach.
Improving the accuracy of k-means algorithm using genetic algorithmKasun Ranga Wijeweera
This document proposes using a genetic algorithm to improve the accuracy of k-means clustering by selecting better initial centroids. It describes generating an initial population of random centroids, evaluating their fitness using a k-means objective function, and evolving the population over generations using selection, crossover and mutation to converge on high-fitness initial centroids. Testing showed this genetic k-means approach produced more accurate and globally optimized clustering results than random initial centroids.
The document provides an overview of real-time PCR (polymerase chain reaction). It discusses extracting RNA from tissue, converting the RNA to cDNA using reverse transcriptase, performing real-time PCR, and analyzing the results. Several key steps are described, including the importance of RNA quality, using appropriate reverse transcriptase primers and PCR primers, including necessary controls, and selecting appropriate reference standards for normalization.
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow
This document discusses methods for quantifying and analyzing microRNAs (miRNAs) using quantitative PCR (qPCR). It presents a new two-tailed RT-qPCR method that provides high sensitivity and specificity for detecting miRNAs, including discrimination of miRNA isoforms. The method allows unlimited multiplexing in the reverse transcription step followed by singleplex qPCR. The document benchmarks the two-tailed RT-qPCR method on biological samples, showing it can sensitively detect less than 10 molecules and maintain specificity across the entire miRNA sequence. It also demonstrates two-tube multiplexing of the method to profile expression levels of several miRNAs in different tissues.
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...Alexander Gorban
The document discusses whether fractional norms and quasinorms can help overcome the curse of dimensionality. It analyzes three measures of classification accuracy for different values of p in the Minkowski distance. The results show that fractional quasinorms with small p have higher relative contrast and variation, but do not necessarily improve KNN classification performance. In fact, values of p around 0.5, 1, and 2 generally perform best, while extremely small or large p values perform worse. Therefore, the conclusion is that fractional quasinorms do not overcome the curse of dimensionality for classification problems.
Generalizing phylogenetics to infer patterns predicted by processes of divers...Jamie Oaks
This document summarizes a presentation about generalizing phylogenetics to infer patterns predicted by diversification processes. The presentation discusses how phylogenetics is becoming the statistical foundation of biology, and how "big data" presents opportunities to study biology in light of phylogeny. It notes that current models assume independent diversification across lineages, but many processes like biogeography or genome evolution can affect multiple lineages simultaneously. Accounting for shared divergences between lineages could improve inference and provide a framework for studying co-diversification processes. Challenges include developing likelihoods for genomic data and handling large numbers of possible tree topologies. The presentation describes approaches like using the coalescent to integrate over gene trees analytically rather
This document discusses genome sequencing and three approaches to sequencing genomes: hierarchical shotgun sequencing, shotgun sequencing, and de novo whole genome sequencing. It describes key concepts in genome assembly such as contigs, supercontigs/scaffolds, sequence coverage, and physical coverage. It also provides an example of a sequencing read.
Real-time PCR is a technique that monitors DNA amplification during the PCR process in real-time using fluorescence detection. It allows for both quantification of DNA present and detection of DNA amplification as it occurs. Real-time PCR has advantages over traditional PCR such as higher sensitivity, specificity, and ability to provide quantitative results. It uses sequence-specific DNA probes labeled with fluorescent dyes and quenchers to detect amplification of target DNA sequences. Data analysis can provide both absolute and relative quantification of DNA targets. Real-time PCR has many applications including gene expression analysis, disease diagnosis, and food and environmental testing.
Genetic algorithms are a type of evolutionary algorithm that mimics natural selection. They operate on a population of potential solutions applying operators like selection, crossover and mutation to produce the next generation. The algorithm iterates until a termination condition is met, such as a solution being found or a maximum number of generations being produced. Genetic algorithms are useful for optimization and search problems as they can handle large, complex search spaces. However, they require properly defining the fitness function and tuning various parameters like population size, mutation rate and crossover rate.
This talk outlines the general steps for project management in rapid development (design to data in two weeks) of a novel digital PCR assay to validate and quantify low frequency variants discovered by sequencing (NGS) of a targeted comprehensive cancer gene panel by the Ion Torrent PGM on PDX models of metastatic colon cancer and spheroid (3-D) cell cultures.
Goal: If the potential driver mutation is validated, treat both (PDX model & cell culture) with small molecule drugs, investigate coincident response.
Characterization of Novel ctDNA Reference Materials Developed using the Genom...Thermo Fisher Scientific
This document summarizes the development and characterization of novel circulating tumor DNA (ctDNA) reference materials. Fragmented DNA containing single or multiple cancer hotspot mutations was spiked into normal human plasma at defined allelic frequencies ranging from 0.1-50%. The size, concentration, and stability of the reference materials were analyzed. Results showed the materials had a mean size of ~160bp and allelic frequencies matched the expected values. Stability testing demonstrated the ctDNA controls were stable in plasma for up to 15 months. The reference materials were developed to enable simpler validation and quality control of ctDNA detection tests.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arth...Fatma Sayed Ibrahim
My M.Sc. dissertation defense. The title is "Algorithm Implementation of Genetic Association Analysis for Rheumatoid Arthritis Data Based on Haplotype Blocks"
This document summarizes the process used to benchmark large deletion calls from multiple sequencing technologies and bioinformatics pipelines. Researchers merged deletion calls from 14 datasets into regions and evaluated call size accuracy. Calls supported by two or more technologies were identified as draft benchmark calls. Sensitivity to these calls was calculated for each method. The results provide insight into strengths and weaknesses of different approaches to structural variant detection.
This document provides information about measuring cell cycle by flow cytometry using DNA staining. Propidium iodide (PI) and DAPI are fluorescent dyes that bind DNA in a stoichiometric manner, allowing identification of cells in different phases of the cell cycle based on DNA content. Doublet discrimination is important to exclude cell aggregates and ensure accurate cell cycle analysis. Parameters such as area, height, and width measured by the flow cytometer help distinguish single cells from doublets based on their DNA staining characteristics. Proper sample preparation and instrument setup are essential for obtaining high quality cell cycle data.
Human identification from DNA is typically based
on 13 short-tandem repeat (STR) alleles. Commercial kits used in forensic casework rely on the detection of these alleles in DNA samples acquired from an individual. However, the process itself is slow (it can take up to 2 days when conducting a laboratory analysis or 1 hour when using Rapid DNA systems) and has been designed to operate on pristine DNA samples. The need for
achieving fast and accurate DNA processing has spurred efforts in developing portable systems that can reduce the processing time to less than 1 hour. But such systems are expected to operate on degraded DNA samples due to the architecture and process used by the instrument. Consequently, detecting the alleles in such degraded DNA samples can be a challenging problem. In this paper, we present an algorithm to detected allelic peaks from degraded DNA signals based on an adaptive signal processing scheme.
Real-time quantitative PCR (qPCR) is a preferred platform for high throughput gene expression profiling, where large numbers of samples are characterized for hundreds of expression markers. Technically, the qPCR measurements are performed in the same way as when classical qPCR is used to analyze only a few targets per sample, but the higher throughput introduces additional sources of potential confounding variation that must be controlled for. In this presentation, Dr Kubista describes how high throughput qPCR profiling studies are designed. He covers assay optimization and validation, sample quality testing, and how to merge multi-plate measurements into a common analysis. Dr Kubista also discusses how to cost-effectively measure and compensate for background due to genomic DNA.
This document provides an introduction to RNA sequencing (RNA-Seq) applications using next-generation sequencing technologies. It discusses how RNA-Seq can be used to identify which genes are expressed, detect differential gene expression between samples, identify splicing isoforms, and detect genetic variants and structural variations. The document reviews Illumina sequencing by synthesis, the most common platform, outlining the work flow from sample acquisition, RNA extraction and library preparation to sequencing. It also discusses considerations for different sample types and extraction methods.
Similar to Splice site recognition among different organisms (20)
Mechanics:- Simple and Compound PendulumPravinHudge1
a compound pendulum is a physical system with a more complex structure than a simple pendulum, incorporating its mass distribution and dimensions into its oscillatory motion around a fixed axis. Understanding its dynamics involves principles of rotational mechanics and the interplay between gravitational potential energy and kinetic energy. Compound pendulums are used in various scientific and engineering applications, such as seismology for measuring earthquakes, in clocks to maintain accurate timekeeping, and in mechanical systems to study oscillatory motion dynamics.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
The Limited Role of the Streaming Instability during Moon and Exomoon FormationSérgio Sacani
It is generally accepted that the Moon accreted from the disk formed by an impact between the proto-Earth and
impactor, but its details are highly debated. Some models suggest that a Mars-sized impactor formed a silicate
melt-rich (vapor-poor) disk around Earth, whereas other models suggest that a highly energetic impact produced a
silicate vapor-rich disk. Such a vapor-rich disk, however, may not be suitable for the Moon formation, because
moonlets, building blocks of the Moon, of 100 m–100 km in radius may experience strong gas drag and fall onto
Earth on a short timescale, failing to grow further. This problem may be avoided if large moonlets (?100 km)
form very quickly by streaming instability, which is a process to concentrate particles enough to cause gravitational
collapse and rapid formation of planetesimals or moonlets. Here, we investigate the effect of the streaming
instability in the Moon-forming disk for the first time and find that this instability can quickly form ∼100 km-sized
moonlets. However, these moonlets are not large enough to avoid strong drag, and they still fall onto Earth quickly.
This suggests that the vapor-rich disks may not form the large Moon, and therefore the models that produce vaporpoor disks are supported. This result is applicable to general impact-induced moon-forming disks, supporting the
previous suggestion that small planets (<1.6 R⊕) are good candidates to host large moons because their impactinduced disks would likely be vapor-poor. We find a limited role of streaming instability in satellite formation in an
impact-induced disk, whereas it plays a key role during planet formation.
Unified Astronomy Thesaurus concepts: Earth-moon system (436)
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
Physics Investigatory Project on transformers. Class 12thpihuart12
Physics investigatory project on transformers with required details for 12thes. with index, theory, types of transformers (with relevant images), procedure, sources of error, aim n apparatus along with bibliography🗃️📜. Please try to add your own imagination rather than just copy paste... Hope you all guys friends n juniors' like it. peace out✌🏻✌🏻
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...MrSproy
ABSTRACT
The J'BaFofi, or "Giant Spider," is a mainly legendary arachnid by reportedly inhabiting the dense rain forests of
the Congo. As despite numerous anecdotal accounts and cultural references, the scientific validation remains more elusive.
My study aims to proper evaluate the existence of the J'BaFofi through the analysis of historical reports,indigenous
testimonies and modern exploration efforts.
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
1. INTERDEPARTMENTAL POSTGRADUATE PROGRAM
"INFORMATION TECHNOLOGIES IN MEDICINE AND BIOLOGY"
MASTER THESIS
Splice site recognition among
different organisms
Despoina I. Kalfakakou
Supervisors:
Stavros Perantonis, Research Director, NCSR Demokritos
George Paliouras, Research Director, NCSR Demokritos
Anastasia Krithara, Post-Doctoral Researcher, NCSR Demokritos
5. RNA Splicing process
Donor, Acceptor:
Splice sites, boundaries
between exons and introns
GU dinucleotide AG dinucleotide
6. Importance of Accurate Splice Site Prediction
• Typical mammalian gene has 7-8 exons spread out over ~16 kb.
• Splice site prediction leads to identification of these exons.
• Exon identification is the first step to accurate genome annotation.
• Currently, hundreds of genomes have been annotated, but
thousands more remain unknown.
• Moreover, many of the already annotated genomes are incorrectly
annotated.
7. Existing Splice Site Prediction Techniques
• Models based on SVMs, HMMS, artificial neural networks.
• Variate DNA sequence representations, most using a large
neighborhood around the donor and acceptor dimers.
• Existing techniques using
traditional machine learning
methods perform well.
~150 nt around dimer
8. Issues of Existing Methods
• Ab initio splice site prediction is a time and money consuming
process.
• Poorly annotated genomes.
• Lack of labeled data.
• Idea: Transfer knowledge from already annotated genomes of
other organisms.
This kind of knowledge transfer is used every day by biologists during their experiments.
In machine learning it is called transfer learning.
9. Transfer Learning
• Introduced in 1995.
• Goal: to reduce the need of collecting and classifying new training data.
• Applications: Sentiment classification, speech recognition, machine
vision etc.
10. Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable
11. Transfer Learning Categorization
Category Source Domain Labels Target Domain Labels
Inductive Transfer Learning Available Available
Transductive Transfer Learning Available Unavailable
Unsupervised Transfer Learning Unavailable Unavailable
• Transferring the knowledge of instances: Importance sampling.
• Transferring the knowledge of feature representation: Find “good” feature representations
to minimize domain divergence and classification error.
12. Proposed Approach
• Bioinformatics Analysis in order to extract the most significant
patterns between organisms.
• Four DNA sequence representations.
• Evaluation of DNA sequence representations using traditional
machine learning.
• Development of two transfer learning models.
13. Data – Evaluation methods
A. Thaliana C. Elegans D. Melanogaster D. Rerio H. Sapiens
In each classification experiment:
• Training data 10000 decoy and 5000 true splice sites
• Test data 10000 decoy and 5000 true splice sites
Evaluation Methods: Accuracy, Area Under the Receiver Operating Characteristic curve (auROC)
For the statistical analysis, we used the DNA sequences of the splice sites of each organism’s
complete genome.
14. PPM and Consensus Calculation
• Features based on bioinformatics analysis of the sequences of
the true splice sites.
• Calculation of Position Probability Matrices (PPMs) and
Consensus sequences for each organism in order to extract
patterns.
• PPM calculation: 𝑀 𝑘,𝑗 =
1
𝑁 𝑖
𝑁
𝐼(𝑋𝑖,𝑗 = 𝑘)
15. Important Positions
• For the next steps, we consider as “important” positions, the
positions in the neighborhood around the splice site dimer
where a nucleotide occurs with a probability > 0.3.
• For the donor splice site the important positions are in a
neighborhood of 11 nt around the donor dimer, with the latter
being at positions 3 and 4 of the neighborhood.
• For the acceptor splice site the important positions are in a
neighborhood of 21 nt around the acceptor dimer, with the
latter being at positions 19 and 20 of the neighborhood.
18. Consensus Sequences
pos 1 2 3 4 5 6 7 8 9 10 11
AT A G G T A A G T AT T T
CE A G G T A A G T T T T
DM A G G T A A G T AT AT AT
DR A G G T A A G T A AT AT
HS A G G T A A G T X X X
pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
AT T T T T T T T T T T T T T T T T G C A G G
CE AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
DM AT AT AT T T T T T T T T T T T T T T C A G A
DR T T T T T T T T T T T T T T T T T C A G G
HS T T T T T T T T T T T T T T T T X C A G G
Donor:
Acceptor:
GT dinucleotide at positions 3-4
of the examined sequence
AG dinucleotide at positions 19-20
of the examined sequence
22. DNA Sequence Representation
• Per se representation:
𝑓 𝑥𝑖 =
0, 𝑥𝑖 = 𝐴
1, 𝑥𝑖 = 𝑇
2, 𝑥𝑖 = 𝐺
3, 𝑥𝑖 = 𝐶
• Binary representation:
𝑓𝑁 𝑖
𝑥𝑖 = ||𝑁𝑖 = 𝑥𝑖 ||
• Score Matrix representation:
𝑓𝑁 𝑖
𝑥𝑖 =
2, 𝑥𝑖 = 𝑁𝑖
1, 𝑥𝑖 = 𝑀𝑖
0, 𝑒𝑙𝑠𝑒
A T C G
A 2 0 0 1
T 0 2 1 0
C 0 1 2 0
G 1 0 0 2
Score Matrix
A and G belong to the purine family
T and C belong to the pyrimidine family
25. Feature Evaluation
• Traditional machine learning classification using SVM
and kNN.
• The values of the parameters used were tested
experimentally.
• SVM: Linear kernel.
• kNN: 5 neighbors, Manhattan distance.
26. Feature Evaluation Results
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens 0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens
Donor Splice Site
27. Feature Evaluation Results
Acceptor Splice Site
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D. Melanogaster
Method: kNN
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Per Se Binary Score Matrix Weights
Accuracy
Representation
Test Organism: D.Melanogaster
Method: SVM
A Thaliana
C Elegans
D Melanogaster
D Rerio
H Sapiens
28. Proposed Models: kNN based
• Iterative algorithm.
• Train a kNN classifier with the data of an organism, e.g. train on
C. Elegans (source domain) and predict on A. Thaliana (target
domain).
• In each iteration recalculate the features for both the source
and the target domain based on the predicted true splice sites.
• Objective: Both source and target domain features approach
target domain distribution.
29. Proposed Models: kNN based
Algorithm 1. kNN based approach
- Represent all sequences in one of the three representations,
based on the source domain data.
- Repeat
• Train kNN classifier with the source domain data.
• Classify the target domain data.
• Recalculate the PPM and/or the consensus.
• Represent all sequences based on the new PPM or consensus.
- Until divergence or a number of iterations.
30. • Iterative algorithm.
• Initiate target centroids to be the same as the source centroids
and predict on target domain organism.
• In each iteration recalculate the features for the target domain
based on the predicted true splice sites and recalculate the
target domain centroids.
• The source domain centroids remain stable and contribute to a
percentage to the classification.
• Objective: The target centroids are “moved” closer to the
target domain data.
Proposed Models: kMeans based
31. Proposed Models: kMeans based
• Algorithm 2. kMeans based approach
• Represent all sequences in one of the three representations, based on the source
domain data.
• Compute source domain centers.
• Initialize target domain centers to be the same as the source domain centers.
• Repeat
• Classify the target domain data based on the function
• Recalculate the PPM and/or the consensus from the target domain instances that are
classified as true splice sites.
• Represent the target domain sequences based on the new PPM or consensus.
• Calculate the new target domain centroids.
• Until divergence or a number of iterations.
32. Evaluation on Proposed Approaches
• In the cases where the consensus sequences and the PPMs of
the source and the target domain data are similar, we don’t
gain much from the transfer learning algorithms.
• In the cases where the consensus sequences differ a lot, both
approaches manage to increase a lot AuROC and Accuracy
percentages.
• kMeans based algorithm performs better than kNN based
algorithm.
• In particular, best results are obtained when the source domain
centroids don’t contribute at all after the first iteration.
33. Evaluation: Binary Sequence Representation
• Accurate and stable representation.
• Consensus sequence extracted from the classify data converges
to target data consensus.
• Example, when trained with C. Elegans data and tested on A.
Thaliana data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G
34. 0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Evaluation: Binary Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans
35. Evaluation: Score Matrix Sequence Representation
• Accurate and stable representation as well.
• Performs better than binary representation.
• Consensus sequence extracted from the classify data converges to target
data consensus.
• Example, when trained with C. Elegans data and tested on A. Thaliana
data:
C. Elegans Consensus:
AT AT AT AT AT AT AT AT AT AT AT AT AT T T T T C A G A
A. Thaliana Consensus:
T T T T T T T T T T T T T T T T G C A G G
Extracted Consensus:
T T T T T T T T T T T T T T T T T C A G G
36. 0.58
0.63
0.68
0.73
0.78
0.83
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.68
0.78
0.88
0.98
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
0.83
0.88
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Evaluation: Score Matrix Sequence Representation
Acceptor
Splice Site
Train Organism:
C. Elegans
37. Evaluation: Weights Sequence Representation
• Although seemed promising in the first set of experiments, it doesn’t perform well using transfer
learning methods.
• The PPM extracted from the classify data does not converges to target data PPM.
• This was expected, as the extracted PPM was constructed using a subset of the available data.
27% 26% 24% 24% 24% 24% 25% 23% 24% 24% 24% 25% 25% 24% 24%
14%
22% 19%
100%
0%
28%
39% 39% 41% 40% 41% 41% 40% 42% 41% 43% 39% 40% 41% 40% 42% 67%
28%
25%
0%
0%
22%
19% 19% 18% 19% 19% 18% 18% 19% 19% 18% 20% 19% 19% 21% 17%
10%
37%
12%
0%
100%
35%
16% 16% 17% 17% 16% 16% 17% 16% 16% 15% 16% 17% 15% 15% 16%
9% 13%
44%
0% 0%
15%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
C. Elegans - A. Thaliana Acceptor extracted PPM
A T G C
24% 24% 22% 21% 20% 19% 20% 19% 19% 19% 21% 21% 21% 20% 20% 16%
27%
6%
100%
0%
26%
46% 47% 48% 49% 50% 51% 52% 53% 53% 52% 48% 50% 51% 51% 53% 64%
28%
28%
0%
0%
12%
17% 16% 18% 17% 17% 17% 17% 16% 16% 17% 19% 16% 16% 18% 15%
11%
39%
1%
0%
100%
54%
15% 15% 15% 15% 15% 14% 14% 14% 14% 14% 15% 15% 13% 13% 14% 11% 9%
67%
0% 0%
10%
Pos.
1
Pos.
2
Pos.
3
Pos.
4
Pos.
5
Pos.
6
Pos.
7
Pos.
8
Pos.
9
Pos.
10
Pos.
11
Pos.
12
Pos.
13
Pos.
14
Pos.
15
Pos.
16
Pos.
17
Pos.
18
Pos.
19
Pos.
20
Pos.
21
A. Thaliana Acceptor target PPM
A T G C
38. Evaluation: Weights Sequence Representation
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kNN based classification
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
A Thaliana D Melanogaster D Rerio H Sapiens
AuROC
Iteration
kMeans based classification - 0% Source data
contribution
1st 2nd 3rd 4th
0.55
0.6
0.65
0.7
0.75
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 40% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
0.58
0.63
0.68
0.73
0.78
1st 2nd 3rd 4th
AuROC
Iteration
kMeans based classification - 80% Source data
contribution
A Thaliana D Melanogaster D Rerio H Sapiens
Acceptor
Splice Site
Train Organism:
C. Elegans
39. Summary
• The common patterns in the sequences of the five studied
organisms were extracted using bioinformatics analysis.
• For the classification task, only the “important” positions of the
neighborhood were used.
• Four DNA sequence representations were proposed, namely:
• Sequence per se.
• Binary representation.
• Score Matrix representation.
• Weighted representation.
• Binary, Score Matrix and Weighted representations perform well
even when using traditional machine learning.
40. • Two transfer learning algorithms are proposed:
• kNN based.
• kMeans based.
• When the patterns in the sequences are similar between
organisms, transfer learning doesn’t contribute a lot as the
results are already good.
• When the patterns differ a lot, kMeans based algorithm with no
source data contribution after first iteration, helps reducing the
gap.
• Best performance: Score Matrix Representation.
Summary
41. Future Steps
• More experiments using all the available data.
• Study more organisms.
• Perform detailed comparison with other approaches.
• Variation of the proposed transfer learning models, using in the
training set in each iteration, the most certainly classified target
data from the previous iteration.