This document summarizes various computational methods for analyzing high-dimensional phenotypic data from screens of perturbed genes, including enrichment analysis, gene set enrichment analysis (GSEA), mapping phenotypes to networks, hierarchical clustering, and ranking genes by phenotypic similarity to a query. Key methods covered are enrichment analysis using hypergeometric tests to assess overrepresentation of hits in gene sets, hierarchical clustering to build clusters based on phenotypic distance metrics, and ranking genes based on similarity to a query phenotype profile.
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisShana White
The document provides an introduction to gene set enrichment analysis (GSEA) methodology. It describes how GSEA analyzes gene expression data to determine whether a particular set of genes, defined a priori, shows statistically significant differences between biological conditions (e.g. case vs. control groups). The key steps are: ranking genes based on their correlation with the conditions, calculating a running enrichment score to quantify overrepresentation of the gene set at the top or bottom of the ranking, and assessing significance through permutation testing. An example analysis compares gene expression from ulcerated vs. uninjured mouse stomachs to test for enrichment of genes related to stomach epithelial metaplasia.
This document discusses mining complex relationships between microRNAs, transcription factors, and genes from heterogeneous data sources using causal inference approaches. Specifically, it describes a project that aims to infer regulatory relationships between microRNAs and mRNAs from multiple data sources including DNA sequences, gene expression data, and domain knowledge. It also discusses using causal inference methods like IDA to detect condition-specific regulatory relationships by analyzing samples split according to normal or cancer conditions.
IRJET- Disease Identification using Proteins Values and Regulatory ModulesIRJET Journal
1) The document proposes developing a common knowledge base for genomic and proteomic analysis to identify genetic disorders using regulatory modules.
2) It involves using collaborative filtering and depth first search to cluster gene ontology terms and regulatory modules for each gene expression.
3) Finally, a Bayesian rose tree is used to represent the taxonomy for a particular gene ID and identify associated diseases.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two, effectively reversing the random perturbation and breaking the privacy protections. Experimental results demonstrate the algorithm is effective at recovering the original distributions from perturbed data.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two, effectively reversing the random perturbation and breaking the privacy protections. Experimental results demonstrate the algorithm is effective at recovering the original distributions from perturbed data.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two. Experimental results demonstrate the algorithm can accurately recover original distributions and even individual records, challenging the assumption that random perturbation preserves privacy at the distribution level but not individual records.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
This document summarizes various computational methods for analyzing high-dimensional phenotypic data from screens of perturbed genes, including enrichment analysis, gene set enrichment analysis (GSEA), mapping phenotypes to networks, hierarchical clustering, and ranking genes by phenotypic similarity to a query. Key methods covered are enrichment analysis using hypergeometric tests to assess overrepresentation of hits in gene sets, hierarchical clustering to build clusters based on phenotypic distance metrics, and ranking genes based on similarity to a query phenotype profile.
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisShana White
The document provides an introduction to gene set enrichment analysis (GSEA) methodology. It describes how GSEA analyzes gene expression data to determine whether a particular set of genes, defined a priori, shows statistically significant differences between biological conditions (e.g. case vs. control groups). The key steps are: ranking genes based on their correlation with the conditions, calculating a running enrichment score to quantify overrepresentation of the gene set at the top or bottom of the ranking, and assessing significance through permutation testing. An example analysis compares gene expression from ulcerated vs. uninjured mouse stomachs to test for enrichment of genes related to stomach epithelial metaplasia.
This document discusses mining complex relationships between microRNAs, transcription factors, and genes from heterogeneous data sources using causal inference approaches. Specifically, it describes a project that aims to infer regulatory relationships between microRNAs and mRNAs from multiple data sources including DNA sequences, gene expression data, and domain knowledge. It also discusses using causal inference methods like IDA to detect condition-specific regulatory relationships by analyzing samples split according to normal or cancer conditions.
IRJET- Disease Identification using Proteins Values and Regulatory ModulesIRJET Journal
1) The document proposes developing a common knowledge base for genomic and proteomic analysis to identify genetic disorders using regulatory modules.
2) It involves using collaborative filtering and depth first search to cluster gene ontology terms and regulatory modules for each gene expression.
3) Finally, a Bayesian rose tree is used to represent the taxonomy for a particular gene ID and identify associated diseases.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two, effectively reversing the random perturbation and breaking the privacy protections. Experimental results demonstrate the algorithm is effective at recovering the original distributions from perturbed data.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two, effectively reversing the random perturbation and breaking the privacy protections. Experimental results demonstrate the algorithm is effective at recovering the original distributions from perturbed data.
This document discusses techniques for privacy-preserving data mining through random data perturbation. It summarizes that while adding random noise to data is intended to preserve privacy, the noise can often be filtered out through spectral analysis of the data's eigen-values, breaking the privacy protections. The document presents an algorithm that exploits differences between eigen-values of true data versus noise to separate the two. Experimental results demonstrate the algorithm can accurately recover original distributions and even individual records, challenging the assumption that random perturbation preserves privacy at the distribution level but not individual records.
Course: Bioinformatics for Biomedical Research (2014).
Session: 3.2- Basic Aspects of Microarray Technology and Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Large scale machine learning challenges for systems biologyMaté Ongenaert
Large scale machine learning challenges for systems biology
by dr. Yvan Saeys - Machine Learning and Data Mining group, Bioinformatics and Systems Biology Division, VIB-UGent Department of Plant Systems Biology
Due to technological advances, the amount of biological data, and the pace at which it is generated has increased dramatically during the past decade. To extract new knowledge from these ever increasing data sets, automated techniques such as data mining and machine learning techniques have become standard practice.
In this talk, I will give an overview of large scale machine learning challenges in bioinformatics and systems biology, highlighting the importance of using scalable and robust techniques such as ensemble learning methods implemented on large computing grids.
I will present some of our state-of-the-art tools to solve problems such as biomarker discovery, large scale network inference, and biomedical text mining at PubMed scale.
This document discusses techniques for privacy-preserving data mining by adding random perturbations to sensitive data. It summarizes that while adding noise aims to protect privacy, the underlying distributions can still be recovered using spectral filtering methods. The document outlines an algorithm that uses eigendecomposition to separate data distributions from noise distributions, enabling recovery of approximately correct anonymous features and breaking privacy protections.
This document provides an overview of genome-wide association studies (GWAS). It discusses the basic concept of GWAS, running and analyzing a GWAS, and interpreting the results. Key points include: GWAS genotype individuals for hundreds of thousands to millions of SNPs to look for associations with traits; extensive quality control is required; imputation can increase SNP coverage; statistical analysis includes computing p-values and correcting for multiple testing; significant findings still require replication in independent samples.
This document discusses issues with reproducibility in EEG research and proposes solutions. It notes that flexible choices in EEG methodology and exploratory analyses can lead to false positives. Simulations demonstrate how double dipping, multiple comparisons, and lack of independent replication can produce significant effects from noise alone. The document advocates for preregistering analysis plans, including dummy effects in studies, subdividing data for exploration and replication, and using registered reports to improve reproducibility in EEG research.
The document discusses classifying brain cancer subtypes using statistical methods. It compares using gene expression data alone, copy number data alone, and both combined. It evaluates Naive Bayes, k-Nearest Neighbors, Support Vector Machine, and Random Forest classifiers. Random Forest achieved the highest average accuracy at 85.09%. Using both gene expression and copy number data together yielded slightly higher accuracy than gene expression alone. The highest individual accuracies were Random Forest on the Mesenchymal-Proneural gene expression dataset at 94.69% and Naive Bayes on the same datasets combined at 93.72%.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
The document discusses machine learning techniques for analyzing omics data. It introduces Velsera, a bioinformatics company, and describes how they used machine learning to predict cancer cell line responses to drugs based on gene expression data. Specifically, they cleaned the data, performed feature selection, and tested models like elastic net, GAMs, and XGBoost (which performed best). The final model identified 20 important genes, including one the client was interested in and another potential biomarker the client was unaware of.
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
1. Machine learning faces challenges in biomedical research due to data heterogeneity, lack of labeled data, and complexity in biological patterns and networks.
2. Combining machine learning and biological network models can help address these challenges by encoding data in biologically meaningful networks and extracting network-based features for prediction.
3. Examples applying this approach to cancer datasets showed that models based on network centrality features outperformed other methods, and deep learning using these features achieved the best prediction performance across multiple neuroblastoma datasets.
This document provides an overview of a course on systems biology. It begins with definitions of systems biology from various experts that emphasize examining biological systems as a whole through interactions rather than isolated parts. The course will cover basic analysis tools for large datasets like clustering and correlations. It will also discuss advanced modular analysis and modeling small networks. Standard analysis techniques like clustering gene expression data are demonstrated.
This document provides an overview of a course on systems biology. It defines systems biology as examining biological systems as a whole through the interactions of all components, rather than in isolation. The course covers basic analysis tools for large datasets, such as examining distributions, correlations, and clustering. It also discusses more advanced modular analysis tools like biclustering that decompose data into transcription modules. An example iterative algorithm called the Signature Algorithm is described for finding related genes based on expression profiles and score thresholds.
A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene.
Avoid overfitting in precision medicine: How to use cross-validation to relia...Nicole Krämer
The identification of patient subgroups who may derive benefit from a treatment is of crucial importance in precision medicine. Many different algorithms have been proposed and studied in the literature.
We illustrate that many of these algorithms overfit in the sense that the treatment benefit for the identified patients is substantially overestimated. Further, we show that with cross-validation, it is possible to obtain more realistic estimates.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover distributions from perturbed data while preventing identification of individual records. Spectral filtering is proposed to separate meaningful data patterns from predictable noise properties, potentially breaking privacy protections.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover the original distribution from perturbed data while preserving individual privacy, such as using the eigen-properties of noise to filter it out. The discussion covers the trade-off between privacy and accuracy of data mining models on perturbed data.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover distributions from perturbed data while preventing identification of individual records. Spectral filtering is proposed to separate noise from original data by exploiting properties of random matrices.
Microarrays allow researchers to analyze gene expression levels across thousands of genes simultaneously. DNA microarrays work by hybridizing fluorescently-labeled cDNA or cRNA to complementary DNA probes attached to a solid surface. This technology has applications in gene expression profiling, disease diagnosis, drug discovery, and toxicology research. While microarrays provide high-throughput analysis, their limitations include not reflecting true protein levels, complex data analysis, expense, and short shelf life of DNA chips.
Microarray data analysis involves several key steps:
1) Feature extraction converts the scanned microarray image into quantifiable gene expression values.
2) Quality control assesses the microarray for errors through diagnostic plots of intensities and distributions.
3) Normalization controls for technical variations between assays while preserving biological variations.
4) Differential expression analysis identifies genes with different expression levels between conditions, while correcting for multiple testing.
5) Biological interpretation and public database submission provide meaning and accessibility of the results.
diagnosis of cancer, bioluminescent detection, diagnosis of cancer, haplotype mapping, imaging gene expression in vivo, types of cancer diagnosis method, ultrasound imaging
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
Pathomics Based Biomarkers, Tools, and Methodsimgcommcall
This document discusses pathomics-based biomarkers, tools, and methods for multi-scale integrative analysis in biomedical informatics. It summarizes several projects involving extracting quantitative features from pathology and radiology images using image segmentation and analysis techniques. These features are then linked to molecular data and clinical outcomes using statistical and machine learning methods to develop biomarkers. The tools and methods described aim to standardize and optimize feature extraction while accounting for uncertainties.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Large scale machine learning challenges for systems biologyMaté Ongenaert
Large scale machine learning challenges for systems biology
by dr. Yvan Saeys - Machine Learning and Data Mining group, Bioinformatics and Systems Biology Division, VIB-UGent Department of Plant Systems Biology
Due to technological advances, the amount of biological data, and the pace at which it is generated has increased dramatically during the past decade. To extract new knowledge from these ever increasing data sets, automated techniques such as data mining and machine learning techniques have become standard practice.
In this talk, I will give an overview of large scale machine learning challenges in bioinformatics and systems biology, highlighting the importance of using scalable and robust techniques such as ensemble learning methods implemented on large computing grids.
I will present some of our state-of-the-art tools to solve problems such as biomarker discovery, large scale network inference, and biomedical text mining at PubMed scale.
This document discusses techniques for privacy-preserving data mining by adding random perturbations to sensitive data. It summarizes that while adding noise aims to protect privacy, the underlying distributions can still be recovered using spectral filtering methods. The document outlines an algorithm that uses eigendecomposition to separate data distributions from noise distributions, enabling recovery of approximately correct anonymous features and breaking privacy protections.
This document provides an overview of genome-wide association studies (GWAS). It discusses the basic concept of GWAS, running and analyzing a GWAS, and interpreting the results. Key points include: GWAS genotype individuals for hundreds of thousands to millions of SNPs to look for associations with traits; extensive quality control is required; imputation can increase SNP coverage; statistical analysis includes computing p-values and correcting for multiple testing; significant findings still require replication in independent samples.
This document discusses issues with reproducibility in EEG research and proposes solutions. It notes that flexible choices in EEG methodology and exploratory analyses can lead to false positives. Simulations demonstrate how double dipping, multiple comparisons, and lack of independent replication can produce significant effects from noise alone. The document advocates for preregistering analysis plans, including dummy effects in studies, subdividing data for exploration and replication, and using registered reports to improve reproducibility in EEG research.
The document discusses classifying brain cancer subtypes using statistical methods. It compares using gene expression data alone, copy number data alone, and both combined. It evaluates Naive Bayes, k-Nearest Neighbors, Support Vector Machine, and Random Forest classifiers. Random Forest achieved the highest average accuracy at 85.09%. Using both gene expression and copy number data together yielded slightly higher accuracy than gene expression alone. The highest individual accuracies were Random Forest on the Mesenchymal-Proneural gene expression dataset at 94.69% and Naive Bayes on the same datasets combined at 93.72%.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
The document discusses machine learning techniques for analyzing omics data. It introduces Velsera, a bioinformatics company, and describes how they used machine learning to predict cancer cell line responses to drugs based on gene expression data. Specifically, they cleaned the data, performed feature selection, and tested models like elastic net, GAMs, and XGBoost (which performed best). The final model identified 20 important genes, including one the client was interested in and another potential biomarker the client was unaware of.
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
1. Machine learning faces challenges in biomedical research due to data heterogeneity, lack of labeled data, and complexity in biological patterns and networks.
2. Combining machine learning and biological network models can help address these challenges by encoding data in biologically meaningful networks and extracting network-based features for prediction.
3. Examples applying this approach to cancer datasets showed that models based on network centrality features outperformed other methods, and deep learning using these features achieved the best prediction performance across multiple neuroblastoma datasets.
This document provides an overview of a course on systems biology. It begins with definitions of systems biology from various experts that emphasize examining biological systems as a whole through interactions rather than isolated parts. The course will cover basic analysis tools for large datasets like clustering and correlations. It will also discuss advanced modular analysis and modeling small networks. Standard analysis techniques like clustering gene expression data are demonstrated.
This document provides an overview of a course on systems biology. It defines systems biology as examining biological systems as a whole through the interactions of all components, rather than in isolation. The course covers basic analysis tools for large datasets, such as examining distributions, correlations, and clustering. It also discusses more advanced modular analysis tools like biclustering that decompose data into transcription modules. An example iterative algorithm called the Signature Algorithm is described for finding related genes based on expression profiles and score thresholds.
A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene.
Avoid overfitting in precision medicine: How to use cross-validation to relia...Nicole Krämer
The identification of patient subgroups who may derive benefit from a treatment is of crucial importance in precision medicine. Many different algorithms have been proposed and studied in the literature.
We illustrate that many of these algorithms overfit in the sense that the treatment benefit for the identified patients is substantially overestimated. Further, we show that with cross-validation, it is possible to obtain more realistic estimates.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover distributions from perturbed data while preventing identification of individual records. Spectral filtering is proposed to separate meaningful data patterns from predictable noise properties, potentially breaking privacy protections.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover the original distribution from perturbed data while preserving individual privacy, such as using the eigen-properties of noise to filter it out. The discussion covers the trade-off between privacy and accuracy of data mining models on perturbed data.
This document summarizes a research paper about random data perturbation techniques used to preserve privacy while still allowing useful data mining. It discusses how simply anonymizing records is not truly private, as the data can be re-identified by linking anonymous records to other publicly available information. The paper presents techniques like adding small random perturbations to the data or distributing the data across parties. It evaluates methods to recover distributions from perturbed data while preventing identification of individual records. Spectral filtering is proposed to separate noise from original data by exploiting properties of random matrices.
Microarrays allow researchers to analyze gene expression levels across thousands of genes simultaneously. DNA microarrays work by hybridizing fluorescently-labeled cDNA or cRNA to complementary DNA probes attached to a solid surface. This technology has applications in gene expression profiling, disease diagnosis, drug discovery, and toxicology research. While microarrays provide high-throughput analysis, their limitations include not reflecting true protein levels, complex data analysis, expense, and short shelf life of DNA chips.
Microarray data analysis involves several key steps:
1) Feature extraction converts the scanned microarray image into quantifiable gene expression values.
2) Quality control assesses the microarray for errors through diagnostic plots of intensities and distributions.
3) Normalization controls for technical variations between assays while preserving biological variations.
4) Differential expression analysis identifies genes with different expression levels between conditions, while correcting for multiple testing.
5) Biological interpretation and public database submission provide meaning and accessibility of the results.
diagnosis of cancer, bioluminescent detection, diagnosis of cancer, haplotype mapping, imaging gene expression in vivo, types of cancer diagnosis method, ultrasound imaging
The document provides an introduction to epistasis detection in genome-wide association studies (GWAS). It defines epistasis as the detection of causal SNPs for a disease through their interactions, rather than their individual effects. It outlines the problem of epistasis detection as analyzing large genotype datasets to find combinations of SNPs that maximize an association measure with binary disease status. Popular measures discussed are chi-squared and mutual information statistics. The document reviews computational methods for epistasis detection, including Multifactor Dimensionality Reduction, SNPHarvester, and SNPRuler. It notes the challenges of reducing computational burden and detecting higher-order epistatic interactions.
Pathomics Based Biomarkers, Tools, and Methodsimgcommcall
This document discusses pathomics-based biomarkers, tools, and methods for multi-scale integrative analysis in biomedical informatics. It summarizes several projects involving extracting quantitative features from pathology and radiology images using image segmentation and analysis techniques. These features are then linked to molecular data and clinical outcomes using statistical and machine learning methods to develop biomarkers. The tools and methods described aim to standardize and optimize feature extraction while accounting for uncertainties.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Population Growth in Bataan: The effects of population growth around rural pl...
Microarray data noise simulation
1. Microarray data noise
simulation
Despoina I. Kalfakakou
Interinstitutional postgraduate program
“Information Technologies in Medicine and Biology”
Course: Simulation methods in medicine and biology
Instructor: Dr G. Spyrou
2. Microarray data
• DNA microarray: a collection of tiny
DNA spots on a surface.
• Used in order to estimate the expression
of a large number of genes at the same
time.
• The expression measurements are saved
in a tsv file, with rows representing the
genes and columns representing the
samples.
3. Microarray data noise
• Biological noise:
– Gene expression is a random and noisy
process.
• “Inner" noise: The result of the inherent
stochasticity of biochemical processes such as
transcription and translation.
• “Outer" noise: Variations in quantities or conditions
from other cellular components (e.g., proteins)
indirectly result in a change in the expression of a
particular gene.
• Technical noise: Artefacts.
5. Information extraction from real
data {1/2}
• Real data constitutes of 30 breast tissue
samples (2 states: 20 healthy tissue; 10
tumour tissue)
• 20368 gene expression measurements per
sample.
• Data are already normalized.
6. Information extraction from real
data{2/2}
1. Per gene study of mean values and standard
deviations of gene expressions for each
state.
2. Significance Analysis for the discovery of
differentially expressed genes (non-
parametric t-test).
3. Significant gene covariance matrix
construction.
4. SVM training using significant gene
expressions.
7. Simulation Model
• Idea: Simulation of an “ideal” – noiseless
distribution. Application of different noise
models.
• Final distribution for gene i :
• xi = ai + ni , where ai is the noiseless distribution, ni
is the noise.
8. Ideal Distribution
• Gene i not significant: Normal distribution, where mean
value equals the real data mean value and
corresponding standard deviation.
• Gene i significant: Multivariate normal distribution, with
parameters: A vector with the real data mean values in
the given situation and the two-dimensional covariance
table Σ of the real correlated significant genes, where:
– Σ[i,j] equals the covariance of genes i and j , if correlated,
– Σ[i,j]=0, if not correlated and
– Σ[i,j]=var(i), if i = j.
9. Noise {1/3}
• The behavior of the data is studied by adding known
noise models:
– Uniform noise:
– Gaussian noise:
12. Evaluation
– Significant Analysis for the discovery of
differentially expressed genes.
– Use of the differentially expressed genes as
test data in the real data trained SVM
classifier.
13. Real Data Significant Analysis
SAM tool (Significant Analysis Of Microarrays).
Upregulated: 70
Downregulated: 236
14. SVM training using real data
• Linear kernel.
• 10-fold cross validation.
• Truth Table:
• Accuracy: 90%.
Predicted
True Normal Diseased
Normal 19 1
Diseased 2 8
23. Future applications{2/2}
• From consensusClusterPlus: Real data can
be divided in 4 categories.
• Noise simulation considering these 4
categories.
24. References {1/2}
• “Novel markers for differentiation of lobular and ductal
invasive breast carcinomas by laser microdissection and
microarray analysis.”, Turashvili et al, BMC Cancer, 2007.
• “Using Gene Expression Noise to Understand Gene
Regulation”, Munsky et al., SCIENE Vol. 336.
• “Simulating Correlated Multivariate Normal Data”, Alison
Kosel, 2009.
• “Interplay between gene expression noise and regulatory
network architecture”, Chalancon et al., Trends in Genetics,
Vol. 28.
• “Models of stochastic gene expression”, Paulsson et al.,
Physics of Life Reviews 2 (2005).
• “Intrinsic and extrinsic contributions to stochasticity in gene
expression”, Swain et al, PNAS Vol. 99.
25. References {2/2}
• “Intrinsic noise in gene regulatory networks”, Mukund Thattai
and Alexander van Oudenaarden, PNAS Vol. 98.
• “Making sense of microarray data distributions”, Hoyle et al,
Bioinformatics Vol. 18.
• “A Flexible Microarray Data Simulation Model”, Doulaye
Dembele, Microarrays, Vol. 2.
• “Simulation of microarray data with realistic characteristics”,
Nykter et al., BMC Bioinformatics 2006.
• http://statweb.stanford.edu/~tibs/SAM/
• “ConsensusClusterPlus: a class discovery tool with confidence
assessments and item tracking”, Wilkerson et al,
Bioinformatics, 2010.