This document summarizes key topics in developing and validating predictive classifiers based on gene expression profiling. It discusses the importance of clear study objectives, feature selection methods, model types, and proper evaluation of classifiers using cross-validation to estimate prediction accuracy, rather than overfitting to the training data. Complex feature selection and model fitting are unlikely to help for high-dimensional genomic data. Simple classification methods like linear discriminant analysis often perform best.
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for many years. In this webcast we will walk through one possible workflow for completing GWAS in Golden Helix SNP & Variation Suite (SVS) with special attention paid to adjusting analysis for population stratification.
Decision Support for Environmental Management of a Chromium Plume at Los Alam...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
A survey of random forest based methods forNikhil Sharma
This document summarizes a research paper that surveys random forest based methods for intrusion detection systems. It begins with an introduction describing the increasing threats to information security with growing network and data usage. It then reviews 35 papers applying random forest techniques to intrusion detection and compares their approaches. These include using random forest for classification, feature selection, and clustering. The document concludes that while random forest methods generally perform well on imbalanced data like intrusion detection, open challenges remain around high data throughput, unlabeled data, and limited benchmark datasets.
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
The document summarizes research on classifying breast cancer datasets using decision trees. The researchers used a Wisconsin breast cancer dataset containing 699 instances with 10 attributes plus a class attribute. They preprocessed the data to handle missing values, compared various classification methods, and achieved the best accuracy of 97% using decision trees with attribute selection. Issues addressed included unbalanced classes and future work proposed methods like clustering and multiple classifiers to further improve accuracy.
The document discusses research methodology and sampling techniques. It begins by outlining the sampling design process, which includes defining the target population, determining the sampling frame, selecting a sampling technique, determining sample size, and executing the sampling. It then covers various probability and non-probability sampling techniques such as simple random sampling, stratified sampling, and convenience sampling. It provides examples and definitions for each technique. The document concludes by discussing considerations for choosing between probability and non-probability sampling.
Gene expression in eukaryotes is regulated through various mechanisms including histone modification, DNA methylation, enhancers, and combinations of regulatory proteins. Gene regulation occurs through positive and negative elements that increase or decrease expression. Key examples are the lac and tryptophan operons in prokaryotes, which are regulated by repressors and inducers to control metabolic pathways in response to environmental conditions.
This document discusses statistical tools used in quality control laboratories and validation studies, including normal distributions, variance, ranges, coefficients of variation, F-tests, Student's t-tests, and paired t-tests. It provides the formulas and procedures for calculating and applying these statistical concepts to analyze laboratory data and test for significant differences between samples. Examples are given to demonstrate how to perform t-tests to compare averages from independent and paired samples with both known and unknown variances.
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for many years. In this webcast we will walk through one possible workflow for completing GWAS in Golden Helix SNP & Variation Suite (SVS) with special attention paid to adjusting analysis for population stratification.
Decision Support for Environmental Management of a Chromium Plume at Los Alam...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
A survey of random forest based methods forNikhil Sharma
This document summarizes a research paper that surveys random forest based methods for intrusion detection systems. It begins with an introduction describing the increasing threats to information security with growing network and data usage. It then reviews 35 papers applying random forest techniques to intrusion detection and compares their approaches. These include using random forest for classification, feature selection, and clustering. The document concludes that while random forest methods generally perform well on imbalanced data like intrusion detection, open challenges remain around high data throughput, unlabeled data, and limited benchmark datasets.
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
The document summarizes research on classifying breast cancer datasets using decision trees. The researchers used a Wisconsin breast cancer dataset containing 699 instances with 10 attributes plus a class attribute. They preprocessed the data to handle missing values, compared various classification methods, and achieved the best accuracy of 97% using decision trees with attribute selection. Issues addressed included unbalanced classes and future work proposed methods like clustering and multiple classifiers to further improve accuracy.
The document discusses research methodology and sampling techniques. It begins by outlining the sampling design process, which includes defining the target population, determining the sampling frame, selecting a sampling technique, determining sample size, and executing the sampling. It then covers various probability and non-probability sampling techniques such as simple random sampling, stratified sampling, and convenience sampling. It provides examples and definitions for each technique. The document concludes by discussing considerations for choosing between probability and non-probability sampling.
Gene expression in eukaryotes is regulated through various mechanisms including histone modification, DNA methylation, enhancers, and combinations of regulatory proteins. Gene regulation occurs through positive and negative elements that increase or decrease expression. Key examples are the lac and tryptophan operons in prokaryotes, which are regulated by repressors and inducers to control metabolic pathways in response to environmental conditions.
This document discusses statistical tools used in quality control laboratories and validation studies, including normal distributions, variance, ranges, coefficients of variation, F-tests, Student's t-tests, and paired t-tests. It provides the formulas and procedures for calculating and applying these statistical concepts to analyze laboratory data and test for significant differences between samples. Examples are given to demonstrate how to perform t-tests to compare averages from independent and paired samples with both known and unknown variances.
1) The document discusses high throughput data analysis techniques including microarrays and next generation sequencing. It provides an overview of microarray experiments, data structure, and analysis methods such as clustering, classification, and gene selection.
2) Specific applications discussed include using penalized logistic regression to classify malaria subtypes and discovering subtype-specific transcripts in breast cancer subtypes from RNA-seq data.
3) The document emphasizes that statistics and bioinformatics play important roles in developing personalized medicine and that big data in healthcare provides many opportunities for new discoveries.
Journal club slides for "Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches" and a description of the software pipeline digit
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
This document discusses the Genome in a Bottle Consortium's efforts to develop reference materials and standards to validate next generation sequencing assays. It provides an overview of the consortium's goals to generate reference genomes with highly confident variant calls and accompanying data to allow labs to compare results and assess false positives and false negatives. The document describes some examples of how labs are using the consortium's data on the NA12878 genome to benchmark sequencing platforms and bioinformatics workflows.
This document discusses several common problems with data handling and quality including building and testing models with the same data, confusion between biological and technical replicates, and identification and handling of outliers. It provides examples and explanations of key concepts such as experimental and sampling units, pseudo-replication, outliers versus high influence points, and leverage plots. The importance of proper data handling techniques like dividing data into training, test, and confirmation sets and using cross-validation is emphasized to avoid overfitting models and generating spurious findings.
This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.
Gene Expression - Microarrays discusses analyzing gene expression data from microarray experiments. It describes the basic workflow including experimental design, sample preparation, hybridization, image analysis, preprocessing, normalization, and statistical analysis. Key points are that microarrays allow measuring expression of thousands of genes simultaneously, and proper experimental design and data analysis are important to draw meaningful biological conclusions from microarray data.
Enhance Genomic Research with Polygenic Risk Score Calculations in SVSGolden Helix
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do trait analysis and association testing on large cohorts of samples in both humans and other species. The latest SVS release introduces a significant leap in capabilities, with a focus on advanced Polygenic Risk Score (PRS) calculations. PRS has become a fundamental tool in genomic research, enabling the identification of correlations between genotypic variants and phenotypes across large populations.
This enhancement is particularly relevant for researchers working on large cohorts and meta-analysis. Please join us as we explore:
-SVS Workflow Review: A review of the extensive capabilities of SVS to meaningful insights from large cohorts and association test result datasets
-Computing Polygenic Risk Scores: An overview of the PRS capabilities in SVS, including Clumping and Thresholding and creation of multiple PRS models
-Evaluating and Applying PRS: Evaluating PRS models in-sample and out-of-sample and applying PRS models to perform trait prediction
-Future Implications: Brief exploration of how these advancements in SVS could influence future genomic research.
This webcast will explore how SVS facilitates the creation of multiple PRS models from large-scale genomic data, such as those obtained from extensive cohort studies or comprehensive meta-analyses. Join us to discover how these latest updates in SVS are supporting large-scale genomic research.
Design of an Intelligent System for Improving Classification of Cancer DiseasesMohamed Loey
The methodologies that depend on gene expression profile have been able to detect cancer since its inception. The previous works have spent great efforts to reach the best results. Some researchers have achieved excellent results in the classification process of cancer based on the gene expression profile using different gene selection approaches and different classifiers
Early detection of cancer increases the probability of recovery. This thesis presents an intelligent decision support system (IDSS) for early diagnosis of cancer-based on the microarray of gene expression profiles. The problem of this dataset is the little number of examples (not exceed hundreds) comparing to a large number of genes (in thousands). So, it became necessary to find out a method for reducing the features (genes) that are not relevant to the investigated disease to avoid overfitting. The proposed methodology used information gain (IG) for selecting the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the Gray Wolf Optimization algorithm (GWO). Finally, the methodology exercises support vector machine (SVM) for cancer type classification. The proposed methodology was applied to three data sets (breast, colon, and CNS) and was evaluated by the classification accuracy performance measurement, which is most important in the diagnosis of diseases. The best results were gotten when integrating IG with GWO and SVM rating accuracy improved to 96.67% and the number of features was reduced to 32 feature of the CNS dataset.
This thesis investigates several classification algorithms and their suitability to the biological domain. For applications that suffer from high dimensionality, different feature selection methods are considered for illustration and analysis. Moreover, an effective system is proposed. In addition, Experiments were conducted on three benchmark gene expression datasets. The proposed system is assessed and compared with related work performance.
The document discusses various feature subset selection methods for gene expression datasets, which have a large number of attributes and small number of samples. It describes filter methods like rank-based and space search-based approaches, as well as wrapper and embedded methods. Rank-based filters calculate correlations like Pearson and mutual information scores to select relevant features. Space search filters evaluate feature subsets for relevancy and redundancy. The document also discusses unsupervised feature selection using maximal information coefficient and affinity propagation clustering. It provides an example of applying feature selection to breast cancer subtyping using consensus clustering across multiple datasets.
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Dmitry Grapov
Dr. Dmitry Grapov gave a webinar on challenges and strategies for next-generation omics analyses. He discussed how large, longitudinal studies integrating multiple omics domains are needed to identify small biological effects. Data normalization strategies must be considered during experimental design to remove analytical batch effects. Quality control-based normalization using analytical replicates can estimate and remove analytical variance from large datasets. Integrating multiple measurement platforms is often required to identify systems of biological changes. Network-based analysis of omics data can help explain more phenotypic variance than single omics approaches alone. Dr. Grapov demonstrated software tools he developed for network analysis, visualization, and integration of multi-omics datasets.
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do association testing and trait analysis on large cohorts of samples in both humans and other species. As samples size increase to do population-scale genomics, the analysis methods need to adapt to remain computable on your analysis workstation.
One of the most popular methods for determining population structure in SVS is Principal Component Analysis. In this webcast, we review the fundamentals of this methodology, as well as how we have advanced the state of the art by implementing a new “Large Data PCA” capability in SVS, handling over 10 times as many samples as previously possible at a fraction of the time. Join us as we cover:
A review of SVS association testing and trait analysis capabilities
Usage of Principle Component Analysis to discern population structure
Scaling PCA beyond the limitations of computer hardware Other SVS improvements based on ongoing feedback from the user community
SVS continues to move forward as a flexible and powerful tool to perform genotype and Large-N variant analysis. We hope you enjoy this webcast highlighting the exciting new features and select enhancements we have made.
This document provides an overview of a project to build a machine learning model to predict Parkinson's disease. It discusses the process of data cleaning, feature engineering, model building and evaluation using different classification techniques. Random forest was found to perform best with an accuracy of 97.2% at predicting Parkinson's disease status based on speech attributes. Key features identified were Delta3, MFCC3, MFCC9, MFCC8 and HNR05. Further improvements could include additional data and techniques like XGBoost.
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
The document summarizes a machine learning project to predict Parkinson's disease. It discusses cleaning and exploring the data, which includes speech attribute data from 240 subjects. Feature importance analysis found attributes like Delta3 and MFCCs to be important. Various machine learning models were tested, with random forest performing best at 97.2% accuracy after cross-validation. The conclusion discusses further optimizing models and collecting more data. Lessons learned note challenges of limited labeled data and importance of domain knowledge.
The DREAM Challenge aims to catalyze interactions between experiment and theory in cellular network inference and quantitative modeling in systems biology. This document describes several DREAM projects and challenges, including the Network Topology and Parameter Inference Challenge, the DREAM-Phil Bowen ALS Prediction Prize4Life, the NCI-DREAM Drug Sensitivity Prediction Challenge, and the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge. The challenges involve using genomic and other biological data to build computational models that can infer networks, predict disease progression, predict drug responses, and predict breast cancer patient survival outcomes.
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
This document provides an overview of the Genome in a Bottle (GIAB) Consortium's efforts to develop human genome reference materials and benchmarks for evaluating genome sequencing and variant calling. It summarizes the characterization of 7 human genomes, including developing variant calls, regions, and reference values. It also describes new efforts using linked and long reads to characterize structural variants and difficult genomic regions. The goal is to provide reference materials and benchmarks to help evaluate sequencing performance and accuracy across different technologies and algorithms.
This document summarizes the process used to benchmark large deletion calls from multiple sequencing technologies and bioinformatics pipelines. Researchers merged deletion calls from 14 datasets into regions and evaluated call size accuracy. Calls supported by two or more technologies were identified as draft benchmark calls. Sensitivity to these calls was calculated for each method. The results provide insight into strengths and weaknesses of different approaches to structural variant detection.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
1) The document discusses high throughput data analysis techniques including microarrays and next generation sequencing. It provides an overview of microarray experiments, data structure, and analysis methods such as clustering, classification, and gene selection.
2) Specific applications discussed include using penalized logistic regression to classify malaria subtypes and discovering subtype-specific transcripts in breast cancer subtypes from RNA-seq data.
3) The document emphasizes that statistics and bioinformatics play important roles in developing personalized medicine and that big data in healthcare provides many opportunities for new discoveries.
Journal club slides for "Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches" and a description of the software pipeline digit
Predictive Analytics of Cell Types Using Single Cell Gene Expression ProfilesAli Al Hamadani
Conducted domain independent predictive analysis pipeline using R for cell type predictions. Applied many predictive analytics models, and machine learning techniques.
This document discusses the Genome in a Bottle Consortium's efforts to develop reference materials and standards to validate next generation sequencing assays. It provides an overview of the consortium's goals to generate reference genomes with highly confident variant calls and accompanying data to allow labs to compare results and assess false positives and false negatives. The document describes some examples of how labs are using the consortium's data on the NA12878 genome to benchmark sequencing platforms and bioinformatics workflows.
This document discusses several common problems with data handling and quality including building and testing models with the same data, confusion between biological and technical replicates, and identification and handling of outliers. It provides examples and explanations of key concepts such as experimental and sampling units, pseudo-replication, outliers versus high influence points, and leverage plots. The importance of proper data handling techniques like dividing data into training, test, and confirmation sets and using cross-validation is emphasized to avoid overfitting models and generating spurious findings.
This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.
Gene Expression - Microarrays discusses analyzing gene expression data from microarray experiments. It describes the basic workflow including experimental design, sample preparation, hybridization, image analysis, preprocessing, normalization, and statistical analysis. Key points are that microarrays allow measuring expression of thousands of genes simultaneously, and proper experimental design and data analysis are important to draw meaningful biological conclusions from microarray data.
Enhance Genomic Research with Polygenic Risk Score Calculations in SVSGolden Helix
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do trait analysis and association testing on large cohorts of samples in both humans and other species. The latest SVS release introduces a significant leap in capabilities, with a focus on advanced Polygenic Risk Score (PRS) calculations. PRS has become a fundamental tool in genomic research, enabling the identification of correlations between genotypic variants and phenotypes across large populations.
This enhancement is particularly relevant for researchers working on large cohorts and meta-analysis. Please join us as we explore:
-SVS Workflow Review: A review of the extensive capabilities of SVS to meaningful insights from large cohorts and association test result datasets
-Computing Polygenic Risk Scores: An overview of the PRS capabilities in SVS, including Clumping and Thresholding and creation of multiple PRS models
-Evaluating and Applying PRS: Evaluating PRS models in-sample and out-of-sample and applying PRS models to perform trait prediction
-Future Implications: Brief exploration of how these advancements in SVS could influence future genomic research.
This webcast will explore how SVS facilitates the creation of multiple PRS models from large-scale genomic data, such as those obtained from extensive cohort studies or comprehensive meta-analyses. Join us to discover how these latest updates in SVS are supporting large-scale genomic research.
Design of an Intelligent System for Improving Classification of Cancer DiseasesMohamed Loey
The methodologies that depend on gene expression profile have been able to detect cancer since its inception. The previous works have spent great efforts to reach the best results. Some researchers have achieved excellent results in the classification process of cancer based on the gene expression profile using different gene selection approaches and different classifiers
Early detection of cancer increases the probability of recovery. This thesis presents an intelligent decision support system (IDSS) for early diagnosis of cancer-based on the microarray of gene expression profiles. The problem of this dataset is the little number of examples (not exceed hundreds) comparing to a large number of genes (in thousands). So, it became necessary to find out a method for reducing the features (genes) that are not relevant to the investigated disease to avoid overfitting. The proposed methodology used information gain (IG) for selecting the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the Gray Wolf Optimization algorithm (GWO). Finally, the methodology exercises support vector machine (SVM) for cancer type classification. The proposed methodology was applied to three data sets (breast, colon, and CNS) and was evaluated by the classification accuracy performance measurement, which is most important in the diagnosis of diseases. The best results were gotten when integrating IG with GWO and SVM rating accuracy improved to 96.67% and the number of features was reduced to 32 feature of the CNS dataset.
This thesis investigates several classification algorithms and their suitability to the biological domain. For applications that suffer from high dimensionality, different feature selection methods are considered for illustration and analysis. Moreover, an effective system is proposed. In addition, Experiments were conducted on three benchmark gene expression datasets. The proposed system is assessed and compared with related work performance.
The document discusses various feature subset selection methods for gene expression datasets, which have a large number of attributes and small number of samples. It describes filter methods like rank-based and space search-based approaches, as well as wrapper and embedded methods. Rank-based filters calculate correlations like Pearson and mutual information scores to select relevant features. Space search filters evaluate feature subsets for relevancy and redundancy. The document also discusses unsupervised feature selection using maximal information coefficient and affinity propagation clustering. It provides an example of applying feature selection to breast cancer subtyping using consensus clustering across multiple datasets.
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Dmitry Grapov
Dr. Dmitry Grapov gave a webinar on challenges and strategies for next-generation omics analyses. He discussed how large, longitudinal studies integrating multiple omics domains are needed to identify small biological effects. Data normalization strategies must be considered during experimental design to remove analytical batch effects. Quality control-based normalization using analytical replicates can estimate and remove analytical variance from large datasets. Integrating multiple measurement platforms is often required to identify systems of biological changes. Network-based analysis of omics data can help explain more phenotypic variance than single omics approaches alone. Dr. Grapov demonstrated software tools he developed for network analysis, visualization, and integration of multi-omics datasets.
Golden Helix’s SNP & Variation Suite (SVS) has been used by researchers around the world to do association testing and trait analysis on large cohorts of samples in both humans and other species. As samples size increase to do population-scale genomics, the analysis methods need to adapt to remain computable on your analysis workstation.
One of the most popular methods for determining population structure in SVS is Principal Component Analysis. In this webcast, we review the fundamentals of this methodology, as well as how we have advanced the state of the art by implementing a new “Large Data PCA” capability in SVS, handling over 10 times as many samples as previously possible at a fraction of the time. Join us as we cover:
A review of SVS association testing and trait analysis capabilities
Usage of Principle Component Analysis to discern population structure
Scaling PCA beyond the limitations of computer hardware Other SVS improvements based on ongoing feedback from the user community
SVS continues to move forward as a flexible and powerful tool to perform genotype and Large-N variant analysis. We hope you enjoy this webcast highlighting the exciting new features and select enhancements we have made.
This document provides an overview of a project to build a machine learning model to predict Parkinson's disease. It discusses the process of data cleaning, feature engineering, model building and evaluation using different classification techniques. Random forest was found to perform best with an accuracy of 97.2% at predicting Parkinson's disease status based on speech attributes. Key features identified were Delta3, MFCC3, MFCC9, MFCC8 and HNR05. Further improvements could include additional data and techniques like XGBoost.
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
The document summarizes a machine learning project to predict Parkinson's disease. It discusses cleaning and exploring the data, which includes speech attribute data from 240 subjects. Feature importance analysis found attributes like Delta3 and MFCCs to be important. Various machine learning models were tested, with random forest performing best at 97.2% accuracy after cross-validation. The conclusion discusses further optimizing models and collecting more data. Lessons learned note challenges of limited labeled data and importance of domain knowledge.
The DREAM Challenge aims to catalyze interactions between experiment and theory in cellular network inference and quantitative modeling in systems biology. This document describes several DREAM projects and challenges, including the Network Topology and Parameter Inference Challenge, the DREAM-Phil Bowen ALS Prediction Prize4Life, the NCI-DREAM Drug Sensitivity Prediction Challenge, and the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge. The challenges involve using genomic and other biological data to build computational models that can infer networks, predict disease progression, predict drug responses, and predict breast cancer patient survival outcomes.
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
This document provides an overview of the Genome in a Bottle (GIAB) Consortium's efforts to develop human genome reference materials and benchmarks for evaluating genome sequencing and variant calling. It summarizes the characterization of 7 human genomes, including developing variant calls, regions, and reference values. It also describes new efforts using linked and long reads to characterize structural variants and difficult genomic regions. The goal is to provide reference materials and benchmarks to help evaluate sequencing performance and accuracy across different technologies and algorithms.
This document summarizes the process used to benchmark large deletion calls from multiple sequencing technologies and bioinformatics pipelines. Researchers merged deletion calls from 14 datasets into regions and evaluated call size accuracy. Calls supported by two or more technologies were identified as draft benchmark calls. Sensitivity to these calls was calculated for each method. The results provide insight into strengths and weaknesses of different approaches to structural variant detection.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Climate Impact of Software Testing at Nordic Testing Days
Vanderbilt b
1. Topics in the Development and
Validation of Gene Expression
Profiling Based Predictive Classifiers
Richard Simon, D.Sc.
Chief, Biometric Research Branch
National Cancer Institute
Linus.nci.nih.gov/brb
2. BRB Website
http://linus.nci.nih.gov/brb
• Powerpoint presentations and audio files
• Reprints & Technical Reports
• BRB-ArrayTools software
• BRB-ArrayTools Data Archive
• Sample Size Planning for Targeted
Clinical Trials
3. Simplified Description of Microarray
Assay
• Extract mRNA from cells of interest
– Each mRNA molecule was transcribed from a single gene and it
has a linear structure complementary to that gene
• Convert mRNA to cDNA introducing a fluorescently
labeled dye to each molecule
• Distribute the cDNA sample to a solid surface containing
“probes” of DNA representing all “genes”; the probes are
in known locations on the surface
• Let the molecules from the sample hybridize with the
probes for the corresponding genes
• Remove excess sample and illuminate surface with laser
with frequency corresponding to the dye
• Measure intensity of fluorescence over each probe
4. Resulting Data
• Intensity over a probe is approximately
proportional to abundance of mRNA
molecules in the sample for the gene
corresponding to the probe
• 40,000 variables measured for each case
– Excessive hype
– Excessive skepticism
– Some familiar statistical paradigms don’t work
well
5. Good Microarray Studies Have
Clear Objectives
• Class Comparison (Gene Finding)
– Find genes whose expression differs among predetermined
classes, e.g. tissue or experimental condition
• Class Prediction
– Prediction of predetermined class (e.g. treatment outcome)
using information from gene expression profile
– Survival risk-group prediction
• Class Discovery
– Discover clusters of specimens having similar expression
profiles
6. Class Comparison and Class
Prediction
• Not clustering problems
• Supervised methods
7. Class Prediction ≠ Class Comparison
• A set of genes is not a predictive model
• Emphasis in class comparison is often on understanding
biological mechanisms
– More difficult than accurate prediction and usually requires a
different experiment
• Demonstrating statistical significance of prognostic
factors is not the same as demonstrating predictive
accuracy
8. Components of Class Prediction
• Feature (gene) selection
– Which genes will be included in the model
• Select model type
– E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor, …
• Fitting parameters (regression
coefficients) for model
– Selecting value of tuning parameters
9. Feature Selection
• Genes that are differentially expressed among the
classes at a significance level α (e.g. 0.01)
– The α level is a tuning parameter
– Number of false discoveries is not of direct relevance for
prediction
• For prediction it is usually more serious to exclude an
informative variable than to include some noise variables
10.
11. Optimal significance level cutoffs for gene selection. 50 differentially expressed genes
out of 22,000 genes on the microarrays
2δ/σ n=10 n=30 n=50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
12. Complex Gene Selection
• Small subset of genes which together give
most accurate predictions
– Genetic algorithms
• Little evidence that complex feature
selection is useful in microarray problems
13. Linear Classifiers for Two
Classes
( )
vector of log ratios or log signals
features (genes) included in model
weight for i'th feature
decision boundary ( ) > or < d
i i
i F
i
l x w x
x
F
w
l x
ε
=
=
=
=
∑
14. Linear Classifiers for Two Classes
• Fisher linear discriminant analysis
• Diagonal linear discriminant analysis (DLDA)
– Ignores correlations among genes
• Compound covariate predictor
• Golub’s weighted voting method
• Support vector machines with inner product
kernel
• Perceptrons
15. When p>>n
• It is always possible to find a set of
features and a weight vector for which the
classification error on the training set is
zero.
• There is generally not sufficient
information in p>>n training sets to
effectively use more complex methods
16. Myth
• Complex classification algorithms such as
neural networks perform better than
simpler methods for class prediction.
17. • Comparative studies have shown that
simpler methods work as well or better for
microarray problems because they avoid
overfitting the data.
19. Evaluating a Classifier
• Most statistical methods were not developed for
p>>n prediction problems
• Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data
• Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy
• Testing whether analysis of independent data
results in selection of the same set of genes is
not an appropriate test of predictive accuracy of
a classifier
20.
21.
22. Internal Validation of a Classifier
• Re-substitution estimate
– Develop classifier on dataset, test predictions
on same data
– Very biased for p>>n
• Split-sample validation
• Cross-validation
23. Split-Sample Evaluation
• Training-set
– Used to select features, select model type, determine
parameters and cut-off thresholds
• Test-set
– Withheld until a single model is fully specified using
the training-set.
– Fully specified model is applied to the expression
profiles in the test-set to predict class labels.
– Number of errors is counted
24. Leave-one-out Cross Validation
• Omit sample 1
– Develop multivariate classifier from scratch on
training set with sample 1 omitted
– Predict class for sample 1 and record whether
prediction is correct
25. Leave-one-out Cross Validation
• Repeat analysis for training sets with each
single sample omitted one at a time
• e = number of misclassifications
determined by cross-validation
• Subdivide e for estimation of sensitivity
and specificity
26. • With proper cross-validation, the model
must be developed from scratch for each
leave-one-out training set. This means
that feature selection must be repeated for
each leave-one-out training set.
– Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of
DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003.
• The cross-validated estimate of
misclassification error is an estimate of the
prediction error for model fit using
specified algorithm to full dataset
27. Prediction on Simulated Null Data
Generation of Gene Expression Profiles
• 14 specimens (Pi is the expression profile for specimen i)
• Log-ratio measurements on 6000 genes
• Pi ~ MVN(0, I6000)
• Can we distinguish between the first 7 specimens (Class 1) and the last 7
(Class 2)?
Prediction Method
• Compound covariate prediction
• Compound covariate built from the log-ratios of the 10 most differentially
expressed genes.
30. Major Flaws Found in 40 Studies
Published in 2004
• Inadequate control of multiple comparisons in gene
finding
– 9/23 studies had unclear or inadequate methods to deal with
false positives
• 10,000 genes x .05 significance level = 500 false positives
• Misleading report of prediction accuracy
– 12/28 reports based on incomplete cross-validation
• Misleading use of cluster analysis
– 13/28 studies invalidly claimed that expression clusters based on
differentially expressed genes could help distinguish clinical
outcomes
• 50% of studies contained one or more major flaws
31. Myth
• Split sample validation is superior to
LOOCV or 10-fold CV for estimating
prediction error
32.
33. Comparison of Internal Validation Methods
Molinaro, Pfiffer & Simon
• For small sample sizes, LOOCV is much less
biased than split-sample validation
• For small sample sizes, LOOCV is preferable to
10-fold, 5-fold cross-validation or repeated k-fold
versions
• For moderate sample sizes, 10-fold is preferable
to LOOCV
• Some claims for bootstrap resampling for
estimating prediction error are not valid for p>>n
problems
39. • Ordinary bootstrap
– Training and test sets overlap
• Bootstrap cross-validation (Fu, Carroll,Wang)
– Perform LOOCV on bootstrap samples
– Training and test sets overlap
• Leave-one-out bootstrap
– Predict for cases not in bootstrap sample
– Training sets are too small
• Out-of-bag bootstrap (Breiman)
– Predict for case i based on majority rule of predictions for
bootstrap samples not containing case i
• .632+ bootstrap
– w*LOOBS+(1-w)RSB
40.
41.
42.
43. Permutation Distribution of Cross-
validated Misclassification Rate of a
Multivariate Classifier
• Randomly permute class labels and
repeat the entire cross-validation
• Re-do for all (or 1000) random
permutations of class labels
• Permutation p value is fraction of random
permutations that gave as few
misclassifications as e in the real data
44.
45.
46. Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors
– Predictive accuracy, not significance
• The two classifiers can be compared by ROC
analysis as functions of the threshold for
classification
• The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic variables
47. Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Some publications fit logistic model to
standard covariates and the cross-validated
predictions of expression profile classifiers
• This is valid only with split-sample analysis
because the cross-validated predictions are
not independent
log ( ) ( | )i iit p y x i zα β γ= + − +
48. Survival Risk Group Prediction
• For analyzing right censored data to develop predictive
classifiers it is not necessary to make the data binary
• Can do cross-validation to predict high or low risk group
for each case
• Compute Kaplan-Meier curves of predicted risk groups
• Permutation significance of log-rank statistic
• Implemented in BRB-ArrayTools
• BRB-ArrayTools also provides for comparing the risk
group classifier based on expression profiles to one
based on standard covariates and one based on a
combination of both types of variables
49. Myth
• Huge sample sizes are needed to develop
effective predictive classifiers
50. Sample Size Planning
References
• K Dobbin, R Simon. Sample size
determination in microarray experiments
for class comparison and prognostic
classification. Biostatistics 6:27-38, 2005
• K Dobbin, R Simon. Sample size planning
for developing classifiers using high
dimensional DNA microarray data.
Biostatistics (2007)
51. Sample Size Planning for Classifier
Development
• The expected value (over training sets) of
the probability of correct classification
PCC(n) should be within γ of the maximum
achievable PCC(∞)
52. Probability Model
• Two classes
• Log expression or log ratio MVN in each class with
common covariance matrix
• m differentially expressed genes
• p-m noise genes
• Expression of differentially expressed genes are
independent of expression for noise genes
• All differentially expressed genes have same inter-class
mean difference 2δ
• Common variance for differentially expressed genes and
for noise genes
53. Classifier
• Feature selection based on univariate t-
tests for differential expression at
significance level α
• Simple linear classifier with equal weights
(except for sign) for all selected genes.
Power for selecting each of the informative
genes that are differentially expressed by
mean difference 2δ is 1-β(n)
54. • For 2 classes of equal prevalence, let λ1 denote
the largest eigenvalue of the covariance matrix
of informative genes. Then
1
( )
m
PCC
δ
σ λ
∞ ≤ Φ
55. ( )
( ) ( )1
1
( ) 1
1
mm
PCC n
m p m
βδ
β
σ λ β α
−
≥ Φ −
− + −
56. 1.0 1.2 1.4 1.6 1.8 2.0
406080100
2 delta/sigma
Samplesize
gamma=0.05
gamma=0.10
Sample size as a function of effect size (log-base 2 fold-change between classes divided by
standard deviation). Two different tolerances shown, . Each class is equally represented in the
population. 22000 genes on an array.
57.
58. b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same
as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the under-
represented class is small (e.g., <20%), then the PCC(60) can decline significantly.
0.1 0.2 0.3 0.4 0.5
0.750.800.85
Proportion in under-represented class
PCC(60)
59.
60.
61. Acknowledgements
• Kevin Dobbin
• Alain Dupuy
• Wenyu Jiang
• Annette Molinaro
• Ruth Pfeiffer
• Michael Radmacher
• Joanna Shih
• Yingdong Zhao
• BRB-ArrayTools Development Team
62. BRB-ArrayTools
• Contains analysis tools that I have selected as
valid and useful
• Analysis wizard and multiple help screens for
biomedical scientists
• Imports data from all platforms and major
databases
• Automated import of data from NCBI Gene
Express Omnibus
63. Predictive Classifiers in
BRB-ArrayTools
• Classifiers
– Diagonal linear discriminant
– Compound covariate
– Bayesian compound covariate
– Support vector machine with
inner product kernel
– K-nearest neighbor
– Nearest centroid
– Shrunken centroid (PAM)
– Random forrest
– Tree of binary classifiers for k-
classes
• Survival risk-group
– Supervised pc’s
• Feature selection options
– Univariate t/F statistic
– Hierarchical variance option
– Restricted by fold effect
– Univariate classification power
– Recursive feature elimination
– Top-scoring pairs
• Validation methods
– Split-sample
– LOOCV
– Repeated k-fold CV
– .632+ bootstrap
64. Selected Features of BRB-ArrayTools
• Multivariate permutation tests for class comparison to control
number and proportion of false discoveries with specified
confidence level
– Permits blocking by another variable, pairing of data, averaging of
technical replicates
• SAM
– Fortran implementation 7X faster than R versions
• Extensive annotation for identified genes
– Internal annotation of NetAffx, Source, Gene Ontology, Pathway
information
– Links to annotations in genomic databases
• Find genes correlated with quantitative factor while controlling
number of proportion of false discoveries
• Find genes correlated with censored survival while controlling
number or proportion of false discoveries
• Analysis of variance
65. Selected Features of BRB-ArrayTools
• Gene set enrichment analysis.
– Gene Ontology groups, signaling pathways, transcription
factor targets, micro-RNA putative targets
– Automatic data download from Broad Institute
– KS & LS test statistics for null hypothesis that gene set is not
enriched
– Hotelling’s and Goeman’s Global test of null hypothesis that
no genes in set are differentially expressed
– Goeman’s Global test for survival data
• Class prediction
– Multiple classifiers
– Complete LOOCV, k-fold CV, repeated k-fold, .632
bootstrap
– permutation significance of cross-validated error rate
66. Selected Features of BRB-ArrayTools
• Survival risk-group prediction
– Supervised principal components with and without clinical
covariates
– Cross-validated Kaplan Meier Curves
– Permutation test of cross-validated KM curves
• Clustering tools for class discovery with
reproducibility statistics on clusters
– Internal access to Eisen’s Cluster and Treeview
• Visualization tools including rotating 3D principal
components plot exportable to Powerpoint with
rotation controls
• Extensible via R plug-in feature
• Tutorials and datasets
67. BRB-ArrayTools
• Extensive built-in gene annotation and
linkage to gene annotation websites
• Publicly available for non-commercial use
– http://linus.nci.nih.gov/brb