Deep learning based multi-omics integration, a surveySOYEON KIM
This document summarizes three papers on deep learning approaches for analyzing omics data:
1) A study that used denoising autoencoders to extract features from breast cancer gene expression data and found the features were linked to clinical characteristics and survival outcomes.
2) A study that used stacked denoising autoencoders to classify breast cancer samples and identify predictive genes related to diagnosis.
3) A study that integrated gene expression and miRNA data with autoencoders to identify survival subgroups in liver cancer patients, which were validated in additional cohorts and found to activate distinct pathways.
Revealing disease-associated pathways by network integration of untargeted me...SOYEON KIM
Soyeon Kim presented on a metabolomics study analyzing metabolites and correlating changes with disease states. Metabolomics can identify and quantify metabolites through targeted or untargeted methods. PIUMet is a network-based algorithm that integrates untargeted metabolomics data to identify novel disease-associated metabolites and proteins by linking disease features to metabolites and inferring subnetworks. Experiments show PIUMet can detect many disease features and identify dysregulated pathways in Huntington's disease.
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...Setia Pramana
This document discusses the role of statisticians in personalized medicine and provides an overview of statistical methods used in bioinformatics. It describes how statisticians are involved in all stages of drug development from discovery through clinical trials. Personalized medicine aims to determine an individual's unique characteristics to select the best treatment. Advanced technologies like microarrays and next-generation sequencing generate large genomic datasets that require statistical analysis for applications like disease classification, biomarker discovery, and identifying disease subtypes and targeted therapies. The document outlines statistical methods used for tasks like microarray data analysis, RNA sequencing, and finding subtype-specific genes and transcripts.
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...SOYEON KIM
17th Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA 2018)
Cancer Data Integration Challenge (http://camda.info/)
Molecular dynamics synchronised Manipulator system to repair Biomoleculesjayakarj
As this system of instrumentation is assigned for a wave propagation in helicity of eigen-rotational string-segments of T-branes in Hilbert space, the detector units proposed in this instrumentation differs from that of the conventional instrumentations with similar objectives, whereas the detector observes eigen-rotational energy with or without string-matter segment.
As per this paradigm, Bi-pyramidal brane (TBP-brane) structures are the constructs for atoms and molecules in analogous to Fermions, while they are eigen-rotational string-segments. TBP-branes are units of string-matter continuum and their dynamics is observational in precision by the observer while the observer and the source TBP-branes are connected by string-matter. Thus the observer induces signals with parameters to inject appropriate impulse to the target-source by the devices, for ‘Environmentally guided molecular self-assembly’ at the target. This is the core scientific principle used in this Instrumentation system. In this, the Holarchical discrete cyclic-time ascribes the synchronization on manipulating the target-source and for this, a KB Fuzzy Controller is been proposed in these instrumentations.
Functioning of this System of Instrumentations, especially by Level-1 & 2 instrumentations opens up a new breed of R & D to analyze the properties of the existing chemical elements and molecules, for finding out new properties of them; with the expectations to resolve SAR and Levinthal's paradoxes in biochemistry. Thus, this may also have substantial impact on new drugs discoveries and other therapeutic and investigative developments in healthcare.
This system of instrumentation provides three levels of applications:
Level - 1
In vitro specimen study of tissues and cells on manipulating the biomolecules of the specimen by Level-1 instrumentations provides data to assign new principles in Molecular Biophysics. This will promote the restructuring of atomic analogy of chemical elements, in that existing properties of elements to be preserved while new properties may be added.
Level - 2
By this level of instrumentations, in vivo animal trials are conducted to conclude the principles evolved in Molecular Biophysics by Level 1 instrumentation. In this, the therapeutic modality for human to be assigned and tested in animals with the evolved molecular models and Molecular Biophysics principles.
Level - 3
The fully tested successful manipulation specifications in animals to be adapted for the therapeutic applications in Humans, by this level of instrumentations.
For details, please visit: http://www.clustermatteruniverse.net
Systems genetics approaches to understand complex traitsSOYEON KIM
Systems genetics aims to understand complex traits by considering genetic variation, intermediate phenotypes like gene expression and metabolites, and their interactions across individuals. It links variations in molecules to clinical traits through correlation analysis and statistical modeling of interaction networks. While challenging, integrating multi-omics data through network approaches can provide a more comprehensive view of the molecular architecture underlying common diseases.
This document discusses mining complex relationships between microRNAs, transcription factors, and genes from heterogeneous data sources using causal inference approaches. Specifically, it describes a project that aims to infer regulatory relationships between microRNAs and mRNAs from multiple data sources including DNA sequences, gene expression data, and domain knowledge. It also discusses using causal inference methods like IDA to detect condition-specific regulatory relationships by analyzing samples split according to normal or cancer conditions.
The Expectation Maximization (EM) algorithm enables parameter estimation for probabilistic models with incomplete data. Probabilistic models like hidden Markov models and Bayesian networks are commonly used to model biological data, and their popularity stems from efficient procedures for learning parameters from observations. However, the data available for training these probabilistic models is often incomplete, with missing values occurring in areas like medical diagnosis with limited test results or gene expression clustering with intentionally omitted gene assignments.
Deep learning based multi-omics integration, a surveySOYEON KIM
This document summarizes three papers on deep learning approaches for analyzing omics data:
1) A study that used denoising autoencoders to extract features from breast cancer gene expression data and found the features were linked to clinical characteristics and survival outcomes.
2) A study that used stacked denoising autoencoders to classify breast cancer samples and identify predictive genes related to diagnosis.
3) A study that integrated gene expression and miRNA data with autoencoders to identify survival subgroups in liver cancer patients, which were validated in additional cohorts and found to activate distinct pathways.
Revealing disease-associated pathways by network integration of untargeted me...SOYEON KIM
Soyeon Kim presented on a metabolomics study analyzing metabolites and correlating changes with disease states. Metabolomics can identify and quantify metabolites through targeted or untargeted methods. PIUMet is a network-based algorithm that integrates untargeted metabolomics data to identify novel disease-associated metabolites and proteins by linking disease features to metabolites and inferring subnetworks. Experiments show PIUMet can detect many disease features and identify dysregulated pathways in Huntington's disease.
The Role of The Statisticians in Personalized Medicine: An Overview of Stati...Setia Pramana
This document discusses the role of statisticians in personalized medicine and provides an overview of statistical methods used in bioinformatics. It describes how statisticians are involved in all stages of drug development from discovery through clinical trials. Personalized medicine aims to determine an individual's unique characteristics to select the best treatment. Advanced technologies like microarrays and next-generation sequencing generate large genomic datasets that require statistical analysis for applications like disease classification, biomarker discovery, and identifying disease subtypes and targeted therapies. The document outlines statistical methods used for tasks like microarray data analysis, RNA sequencing, and finding subtype-specific genes and transcripts.
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...SOYEON KIM
17th Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA 2018)
Cancer Data Integration Challenge (http://camda.info/)
Molecular dynamics synchronised Manipulator system to repair Biomoleculesjayakarj
As this system of instrumentation is assigned for a wave propagation in helicity of eigen-rotational string-segments of T-branes in Hilbert space, the detector units proposed in this instrumentation differs from that of the conventional instrumentations with similar objectives, whereas the detector observes eigen-rotational energy with or without string-matter segment.
As per this paradigm, Bi-pyramidal brane (TBP-brane) structures are the constructs for atoms and molecules in analogous to Fermions, while they are eigen-rotational string-segments. TBP-branes are units of string-matter continuum and their dynamics is observational in precision by the observer while the observer and the source TBP-branes are connected by string-matter. Thus the observer induces signals with parameters to inject appropriate impulse to the target-source by the devices, for ‘Environmentally guided molecular self-assembly’ at the target. This is the core scientific principle used in this Instrumentation system. In this, the Holarchical discrete cyclic-time ascribes the synchronization on manipulating the target-source and for this, a KB Fuzzy Controller is been proposed in these instrumentations.
Functioning of this System of Instrumentations, especially by Level-1 & 2 instrumentations opens up a new breed of R & D to analyze the properties of the existing chemical elements and molecules, for finding out new properties of them; with the expectations to resolve SAR and Levinthal's paradoxes in biochemistry. Thus, this may also have substantial impact on new drugs discoveries and other therapeutic and investigative developments in healthcare.
This system of instrumentation provides three levels of applications:
Level - 1
In vitro specimen study of tissues and cells on manipulating the biomolecules of the specimen by Level-1 instrumentations provides data to assign new principles in Molecular Biophysics. This will promote the restructuring of atomic analogy of chemical elements, in that existing properties of elements to be preserved while new properties may be added.
Level - 2
By this level of instrumentations, in vivo animal trials are conducted to conclude the principles evolved in Molecular Biophysics by Level 1 instrumentation. In this, the therapeutic modality for human to be assigned and tested in animals with the evolved molecular models and Molecular Biophysics principles.
Level - 3
The fully tested successful manipulation specifications in animals to be adapted for the therapeutic applications in Humans, by this level of instrumentations.
For details, please visit: http://www.clustermatteruniverse.net
Systems genetics approaches to understand complex traitsSOYEON KIM
Systems genetics aims to understand complex traits by considering genetic variation, intermediate phenotypes like gene expression and metabolites, and their interactions across individuals. It links variations in molecules to clinical traits through correlation analysis and statistical modeling of interaction networks. While challenging, integrating multi-omics data through network approaches can provide a more comprehensive view of the molecular architecture underlying common diseases.
This document discusses mining complex relationships between microRNAs, transcription factors, and genes from heterogeneous data sources using causal inference approaches. Specifically, it describes a project that aims to infer regulatory relationships between microRNAs and mRNAs from multiple data sources including DNA sequences, gene expression data, and domain knowledge. It also discusses using causal inference methods like IDA to detect condition-specific regulatory relationships by analyzing samples split according to normal or cancer conditions.
The Expectation Maximization (EM) algorithm enables parameter estimation for probabilistic models with incomplete data. Probabilistic models like hidden Markov models and Bayesian networks are commonly used to model biological data, and their popularity stems from efficient procedures for learning parameters from observations. However, the data available for training these probabilistic models is often incomplete, with missing values occurring in areas like medical diagnosis with limited test results or gene expression clustering with intentionally omitted gene assignments.
This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.
The DREAM Challenge aims to catalyze interactions between experiment and theory in cellular network inference and quantitative modeling in systems biology. This document describes several DREAM projects and challenges, including the Network Topology and Parameter Inference Challenge, the DREAM-Phil Bowen ALS Prediction Prize4Life, the NCI-DREAM Drug Sensitivity Prediction Challenge, and the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge. The challenges involve using genomic and other biological data to build computational models that can infer networks, predict disease progression, predict drug responses, and predict breast cancer patient survival outcomes.
This document discusses moving from structure-based drug discovery approaches to network-based approaches that consider biological complexity. It argues that current reductionist models cannot predict drug efficacy and effects since biological effects result from multiple perturbations of cellular networks, not just primary target interactions. The document then outlines key components of a network-based drug discovery approach, including constructing protein-protein interaction networks, identifying network-reachable proteins, and assessing information flow between proteins to better understand disease mechanisms and drug actions.
The document discusses computational methods for predicting protein structure, specifically homology modeling and threading/fold recognition. Homology modeling constructs a target protein structure using the amino acid sequence and experimental structure of a homologous protein as a template. Threading/fold recognition predicts a protein's structural fold by fitting its sequence to structures in a database and selecting the best fitting fold, either through an energy-based method or profile-based method. Both methods are limited as homology modeling relies on a template structure and threading/fold recognition may not find a match if the correct fold does not exist in the database.
Comparative modeling predicts the 3D structure of a target protein sequence based on its alignment to template protein structures of known structure. It consists of four main steps: fold assignment, alignment of the target and template sequences, building a model based on the alignment, and predicting errors in the model. Comparative modeling is often used to facilitate functional characterization of a protein when its experimental structure is unknown, as it can provide a useful 3D structural model for proteins related to templates.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Diagnosis of Cancer using Fuzzy Rough Set TheoryIRJET Journal
This document presents a study that uses fuzzy rough set theory to diagnose cancer using medical data. It has four main modules: 1) feature selection to identify relevant features using fuzzy rough subset evaluation and particle swarm optimization, 2) instance selection to remove missing/noisy data, 3) classification of the data using fuzzy rough nearest neighbor algorithm, and 4) performance analysis using metrics like accuracy, sensitivity and AUC. The study aims to classify different cancer types by reducing noise and selecting optimal features/instances to improve classifier performance. It is found that fuzzy rough set approaches help preserve meaning during reduction and improve classification compared to other methods.
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...ijbbjournal
This document summarizes a study that reconstructed a cancer-specific gene regulatory network for prostate cancer from gene expression profiles. The researchers identified differentially expressed genes between cancer and normal tissue samples using statistical tests. They then computed correlations between gene pairs to identify regulatory relationships, focusing on highly correlated pairs. This resulted in a network of 29 genes and 55 regulatory relationships. The network was validated against biological databases and literature, and topological analysis identified some highly connected "hub" genes that may be potential drug targets.
A Network View on Parkinson’s Disease Elsevier webinar 15 jan 2015Ann-Marie Roche
Professor D.Bonchev shows an in-depth look at how a systems biology approach was used to identify some of the critical aspects Parkinson's disease: molecular players, drug targets, and underlying biological processes.
The Role of Statistician in Personalized Medicine: An Overview of Statistical...Setia Pramana
This document discusses the role of statisticians in personalized medicine and provides an overview of statistical methods used in bioinformatics. It begins with an introduction to the speaker's educational background and current positions. The rest of the document is outlined as follows: an introduction to personalized medicine and patients' heterogeneity; applications of microarray and next-generation sequencing technologies; statistical methods for microarray data analysis including gene selection, classification, clustering, and dose-response studies; and RNA-seq analysis from sequencing to identifying subtype-specific transcripts. Statistics plays an important role in developing personalized medicine through multidisciplinary collaboration and exploring big data in healthcare.
The analysis of proteins and messenger RNA is commonly used in the comparison of gene expression patterns in tissues or cells of different types and under distinct conditions. In gene expression analysis, normalization is a critical step as it guarantees the validity of downstream analyses. Data preprocessing is an indispensable step in the extraction and normalization of microarray gene expression data. The normalization of gene expression data is essential in ensuring accurate inferences. A number of normalization methods in high throughput sequencing studies are being employed. The preprocessing activity begins by a careful analysis of the gene expression data and usually involves the classification of many raw signal intensities into one expression value. The Robust Multiarray Average (RMA) is a normalization approach for microarrays that involves background correction, normalization and summarization of probe levels information without using MM probes (Lim et al., 2007). It is an algorithm commonly used in the creation of an expression matrix for Affymetrix data and is one of the most commonly used modes of preprocessing to normalize gene expression data. Values of raw intensity are initially background corrected and log2 transformed before being normalized. In order to generate an expression measure for probe sets on each array, a linear model is fitted to the normalized data.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has suppli
ed a large volume of data to many fields. The gene
microarray analysis and classification have demonst
rated an effective way for the effective diagnosis
of
diseases and cancers. In as much as the data achiev
ing from microarray technology is very noisy and al
so
has thousands of features, feature selection plays
an important role in removing irrelevant and redund
ant
features and also reducing computational complexity
. There are two important approaches for gene
selection in microarray data analysis, the filters
and the wrappers. To select a concise subset of inf
ormative
genes, we introduce a hybrid feature selection whic
h combines two approaches. The fact of the matter i
s
that candidate’s features are first selected from t
he original set via several effective filters. The
candidate
feature set is further refined by more accurate wra
ppers. Thus, we can take advantage of both the filt
ers
and wrappers. Experimental results based on 11 micr
oarray datasets show that our mechanism can be
effected with a smaller feature set. Moreover, thes
e feature subsets can be obtained in a reasonable t
ime
This document provides an overview of computer-aided drug design (CADD). It discusses how CADD uses computer modeling techniques to accelerate the drug design and development process. The key aspects covered include the modern drug design lifecycle, types of drug design approaches like ligand-based and structure-based methods, and how techniques like docking and scoring are used. The document also notes the significance of CADD in filtering large compound libraries, optimizing lead compounds, and designing novel drug candidates in a more efficient manner compared to traditional drug discovery approaches.
Integrative Genomics of Non-Small Cell Lung Cancer by Peter McLoughlinCirdan
This document summarizes an integrative genomics study of non-small cell lung cancer (NSCLC) using The Cancer Genome Atlas (TCGA) data. The study investigated differential gene expression and DNA methylation patterns between NSCLC tumours at different stages of maturation. An innovative method was developed to integrate methylation with gene expression. Several key findings were identified, including significant numbers of genes differentially expressed across tumour (T)-stages and some genes where differential expression could be attributed to differential methylation. Further research on the identified genes is needed to potentially improve NSCLC survival outcomes. The study was limited by not accounting for batch effects but future improvements could include validating gene findings in external databases and analyzing microRNA and promoter sequences.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
This document describes a study that uses machine learning algorithms to efficiently predict DNA-binding proteins. Support vector machines and cascade correlation neural networks are optimized and compared to determine the best performing model. The SVM model achieves 86.7% accuracy at predicting DNA-binding proteins using features like overall charge, patch size, and amino acid composition of proteins. The CCNN model achieves lower accuracy of 75.4%. The study aims to improve on previous work by using the standard jack-knife validation technique to evaluate model performance on unseen data.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Identification of novel potential anti cancer agents using network pharmacolo...Cresset
This document discusses using network pharmacology and computational modeling to identify novel potential anti-cancer agents. It describes how E-Therapeutics constructs disease networks and then uses its proprietary chemoinformatics tools to identify compounds that could impact those networks. One lead anti-cancer candidate, Dexanabinol, is highlighted which has passed Phase 1 trials. Experimental validation of compounds predicted to impact glioma networks identified over 50% as weakly active potential leads and 14 as highly active candidates, demonstrating the potential of this network pharmacology approach.
This document discusses various data mining techniques for cancer diagnosis and prognosis. It begins by explaining how data mining utilizes databases, statistics, machine learning and pattern recognition. It then focuses on techniques for breast cancer analysis, including decision trees to identify vulnerability, neural networks and association rules for mammography classification, naive Bayes for survival rates, logistic regression and support vector machines to identify treatment effectiveness. Finally, it discusses using Bayesian networks to differentiate poor and good prognosis and proposes future work developing web apps based on these techniques.
This document discusses various data mining techniques for cancer diagnosis and prognosis, including decision trees, association rule mining, neural networks, naive Bayes classification, support vector machines, logistic regression, and Bayesian networks. It analyzes several case studies that apply these techniques to breast cancer data for tasks like classification of mammography images, predicting patient survival rates, and identifying high-risk patient groups. Decision trees were found to achieve the best prediction performance of over 93% accuracy. Future work involves creating web apps based on these models.
This document evaluates several supervised machine learning algorithms for classifying gene expression data from microarray experiments. It describes analyzing two gene expression datasets, the leukemia and DLBCL datasets, using k-nearest neighbors, naive Bayes, decision trees, and support vector machines with and without feature selection. The results show that support vector machines achieved the best performance overall, and that feature selection improved the accuracy of all the algorithms.
The DREAM Challenge aims to catalyze interactions between experiment and theory in cellular network inference and quantitative modeling in systems biology. This document describes several DREAM projects and challenges, including the Network Topology and Parameter Inference Challenge, the DREAM-Phil Bowen ALS Prediction Prize4Life, the NCI-DREAM Drug Sensitivity Prediction Challenge, and the Sage Bionetworks - DREAM Breast Cancer Prognosis Challenge. The challenges involve using genomic and other biological data to build computational models that can infer networks, predict disease progression, predict drug responses, and predict breast cancer patient survival outcomes.
This document discusses moving from structure-based drug discovery approaches to network-based approaches that consider biological complexity. It argues that current reductionist models cannot predict drug efficacy and effects since biological effects result from multiple perturbations of cellular networks, not just primary target interactions. The document then outlines key components of a network-based drug discovery approach, including constructing protein-protein interaction networks, identifying network-reachable proteins, and assessing information flow between proteins to better understand disease mechanisms and drug actions.
The document discusses computational methods for predicting protein structure, specifically homology modeling and threading/fold recognition. Homology modeling constructs a target protein structure using the amino acid sequence and experimental structure of a homologous protein as a template. Threading/fold recognition predicts a protein's structural fold by fitting its sequence to structures in a database and selecting the best fitting fold, either through an energy-based method or profile-based method. Both methods are limited as homology modeling relies on a template structure and threading/fold recognition may not find a match if the correct fold does not exist in the database.
Comparative modeling predicts the 3D structure of a target protein sequence based on its alignment to template protein structures of known structure. It consists of four main steps: fold assignment, alignment of the target and template sequences, building a model based on the alignment, and predicting errors in the model. Comparative modeling is often used to facilitate functional characterization of a protein when its experimental structure is unknown, as it can provide a useful 3D structural model for proteins related to templates.
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
inAbstract— Pattern Recognition (PR) plays an important role in field of Bioinformatics. PR is concerned with processing raw measurement data by a computer to arrive at a prediction that can be used to formulate a decision to be taken. The important problem in which pattern recognition are applied have common that they are too complex to model explicitly. Diverse methods of this PR are used to analyze, segment and manage the high dimensional microarray gene data for classification. PR is concerned with the development of systems that learn to solve a given problem using a set of instances, each instances represented by a number of features. The microarray expression technologies are possible to monitor the expression levels of thousands of genes simultaneously. The microarrays generated large amount of data has stimulate the development of various computational methods to different biological processes by gene expression profiling. Microarray Gene Expression Profiling (MGEP) is important in Bioinformatics, it yield various high dimensional data used in various clinical applications like cancer diagnostics and drug designing. In this work a new schema has developed for classification of unknown malignant tumors into known class. According to this work an new classification scheme includes the transformation of very high dimensional microarray data into mahalanobis space before classification. The eligibility of the proposed classification scheme has proved to 10 commonly available cancer gene datasets, this contains both the binary and multiclass data sets. To improve the performance of the classification gene selection method is applied to the datasets as a preprocessing and data extraction step.
Diagnosis of Cancer using Fuzzy Rough Set TheoryIRJET Journal
This document presents a study that uses fuzzy rough set theory to diagnose cancer using medical data. It has four main modules: 1) feature selection to identify relevant features using fuzzy rough subset evaluation and particle swarm optimization, 2) instance selection to remove missing/noisy data, 3) classification of the data using fuzzy rough nearest neighbor algorithm, and 4) performance analysis using metrics like accuracy, sensitivity and AUC. The study aims to classify different cancer types by reducing noise and selecting optimal features/instances to improve classifier performance. It is found that fuzzy rough set approaches help preserve meaning during reduction and improve classification compared to other methods.
Reconstruction and analysis of cancerspecific Gene regulatory networks from G...ijbbjournal
This document summarizes a study that reconstructed a cancer-specific gene regulatory network for prostate cancer from gene expression profiles. The researchers identified differentially expressed genes between cancer and normal tissue samples using statistical tests. They then computed correlations between gene pairs to identify regulatory relationships, focusing on highly correlated pairs. This resulted in a network of 29 genes and 55 regulatory relationships. The network was validated against biological databases and literature, and topological analysis identified some highly connected "hub" genes that may be potential drug targets.
A Network View on Parkinson’s Disease Elsevier webinar 15 jan 2015Ann-Marie Roche
Professor D.Bonchev shows an in-depth look at how a systems biology approach was used to identify some of the critical aspects Parkinson's disease: molecular players, drug targets, and underlying biological processes.
The Role of Statistician in Personalized Medicine: An Overview of Statistical...Setia Pramana
This document discusses the role of statisticians in personalized medicine and provides an overview of statistical methods used in bioinformatics. It begins with an introduction to the speaker's educational background and current positions. The rest of the document is outlined as follows: an introduction to personalized medicine and patients' heterogeneity; applications of microarray and next-generation sequencing technologies; statistical methods for microarray data analysis including gene selection, classification, clustering, and dose-response studies; and RNA-seq analysis from sequencing to identifying subtype-specific transcripts. Statistics plays an important role in developing personalized medicine through multidisciplinary collaboration and exploring big data in healthcare.
The analysis of proteins and messenger RNA is commonly used in the comparison of gene expression patterns in tissues or cells of different types and under distinct conditions. In gene expression analysis, normalization is a critical step as it guarantees the validity of downstream analyses. Data preprocessing is an indispensable step in the extraction and normalization of microarray gene expression data. The normalization of gene expression data is essential in ensuring accurate inferences. A number of normalization methods in high throughput sequencing studies are being employed. The preprocessing activity begins by a careful analysis of the gene expression data and usually involves the classification of many raw signal intensities into one expression value. The Robust Multiarray Average (RMA) is a normalization approach for microarrays that involves background correction, normalization and summarization of probe levels information without using MM probes (Lim et al., 2007). It is an algorithm commonly used in the creation of an expression matrix for Affymetrix data and is one of the most commonly used modes of preprocessing to normalize gene expression data. Values of raw intensity are initially background corrected and log2 transformed before being normalized. In order to generate an expression measure for probe sets on each array, a linear model is fitted to the normalized data.
An Ensemble of Filters and Wrappers for Microarray Data Classification mlaij
The development of microarray technology has suppli
ed a large volume of data to many fields. The gene
microarray analysis and classification have demonst
rated an effective way for the effective diagnosis
of
diseases and cancers. In as much as the data achiev
ing from microarray technology is very noisy and al
so
has thousands of features, feature selection plays
an important role in removing irrelevant and redund
ant
features and also reducing computational complexity
. There are two important approaches for gene
selection in microarray data analysis, the filters
and the wrappers. To select a concise subset of inf
ormative
genes, we introduce a hybrid feature selection whic
h combines two approaches. The fact of the matter i
s
that candidate’s features are first selected from t
he original set via several effective filters. The
candidate
feature set is further refined by more accurate wra
ppers. Thus, we can take advantage of both the filt
ers
and wrappers. Experimental results based on 11 micr
oarray datasets show that our mechanism can be
effected with a smaller feature set. Moreover, thes
e feature subsets can be obtained in a reasonable t
ime
This document provides an overview of computer-aided drug design (CADD). It discusses how CADD uses computer modeling techniques to accelerate the drug design and development process. The key aspects covered include the modern drug design lifecycle, types of drug design approaches like ligand-based and structure-based methods, and how techniques like docking and scoring are used. The document also notes the significance of CADD in filtering large compound libraries, optimizing lead compounds, and designing novel drug candidates in a more efficient manner compared to traditional drug discovery approaches.
Integrative Genomics of Non-Small Cell Lung Cancer by Peter McLoughlinCirdan
This document summarizes an integrative genomics study of non-small cell lung cancer (NSCLC) using The Cancer Genome Atlas (TCGA) data. The study investigated differential gene expression and DNA methylation patterns between NSCLC tumours at different stages of maturation. An innovative method was developed to integrate methylation with gene expression. Several key findings were identified, including significant numbers of genes differentially expressed across tumour (T)-stages and some genes where differential expression could be attributed to differential methylation. Further research on the identified genes is needed to potentially improve NSCLC survival outcomes. The study was limited by not accounting for batch effects but future improvements could include validating gene findings in external databases and analyzing microRNA and promoter sequences.
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many
biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon
cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies
in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in
their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms
and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the
matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix
Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification
accuracies are then compared for these algorithms.This technique gives an accuracy of 98%
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...IJDKP
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets.This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms.This technique gives an accuracy of 98%.
This document describes a study that uses machine learning algorithms to efficiently predict DNA-binding proteins. Support vector machines and cascade correlation neural networks are optimized and compared to determine the best performing model. The SVM model achieves 86.7% accuracy at predicting DNA-binding proteins using features like overall charge, patch size, and amino acid composition of proteins. The CCNN model achieves lower accuracy of 75.4%. The study aims to improve on previous work by using the standard jack-knife validation technique to evaluate model performance on unseen data.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
Identification of novel potential anti cancer agents using network pharmacolo...Cresset
This document discusses using network pharmacology and computational modeling to identify novel potential anti-cancer agents. It describes how E-Therapeutics constructs disease networks and then uses its proprietary chemoinformatics tools to identify compounds that could impact those networks. One lead anti-cancer candidate, Dexanabinol, is highlighted which has passed Phase 1 trials. Experimental validation of compounds predicted to impact glioma networks identified over 50% as weakly active potential leads and 14 as highly active candidates, demonstrating the potential of this network pharmacology approach.
This document discusses various data mining techniques for cancer diagnosis and prognosis. It begins by explaining how data mining utilizes databases, statistics, machine learning and pattern recognition. It then focuses on techniques for breast cancer analysis, including decision trees to identify vulnerability, neural networks and association rules for mammography classification, naive Bayes for survival rates, logistic regression and support vector machines to identify treatment effectiveness. Finally, it discusses using Bayesian networks to differentiate poor and good prognosis and proposes future work developing web apps based on these techniques.
This document discusses various data mining techniques for cancer diagnosis and prognosis, including decision trees, association rule mining, neural networks, naive Bayes classification, support vector machines, logistic regression, and Bayesian networks. It analyzes several case studies that apply these techniques to breast cancer data for tasks like classification of mammography images, predicting patient survival rates, and identifying high-risk patient groups. Decision trees were found to achieve the best prediction performance of over 93% accuracy. Future work involves creating web apps based on these models.
Mining of Important Informative Genes and Classifier Construction for Cancer ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust gene identification methods is extremely fundamental. Many gene selection methods as well as their corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination capability is selected and classification rules are generated for cancer based on gene expression profiles. The method first computes importance factor of each gene of experimental cancer dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high class discrimination capability according to their depended degree of classes. Then initial important genes are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans clustering algorithm is applied on each selected gene of initial reduct and compute missclassification errors of individual genes. The final reduct is formed by selecting most important genes with respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples of experimental test dataset. The proposed method test on four publicly available cancerous gene expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to prove the robustness of proposed method compares the outcomes (correctly classified instances) with some existing well known classifiers.
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
Microarray is a useful technique for measuring expression data of thousands or more of genes
simultaneously. One of challenges in classification of cancer using high-dimensional gene expression data
is to select a minimal number of relevant genes which can maximize classification accuracy. Because of the
distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and
robust gene identification methods is extremely fundamental. Many gene selection methods as well as their
corresponding classifiers have been proposed. In the proposed method, a single gene with high classdiscrimination
capability is selected and classification rules are generated for cancer based on gene
expression profiles. The method first computes importance factor of each gene of experimental cancer
dataset by counting number of linguistic terms (defined in terms of different discreet quantity) with high
class discrimination capability according to their depended degree of classes. Then initial important genes
are selected according to high importance factor of each gene and form initial reduct. Then traditional kmeans
clustering algorithm is applied on each selected gene of initial reduct and compute missclassification
errors of individual genes. The final reduct is formed by selecting most important genes with
respect to less miss-classification errors. Then a classifier is constructed based on decision rules induced
by selected important genes (single) from training dataset to classify cancerous and non-cancerous samples
of experimental test dataset. The proposed method test on four publicly available cancerous gene
expression test dataset. In most of cases, accurate classifications outcomes are obtained by just using
important (single) genes that are highly correlated with the pathogenesis cancer are identified. Also to
prove the robustness of proposed method compares the outcomes (correctly classified instances) with some
existing well known classifiers.
The document presents a proposal for developing a hybrid machine learning model for prostate cancer detection. It aims to combine deep convolutional neural networks (DCNN) and fuzzy support vector machines (SVMs) to overcome limitations of individual models. The methodology section outlines the steps as: pre-processing data, extracting features using DCNN, training and testing the fuzzy SVM classifier, and evaluating performance. Key aspects of the DCNN and fuzzy SVM approaches are also summarized, such as convolutional and pooling layers, fully connected layers, and the SVM classification technique. The proposal seeks to improve prostate cancer detection accuracy through this hybrid modeling approach.
The document summarizes a machine learning project to predict Parkinson's disease. It discusses cleaning and exploring the data, which includes speech attribute data from 240 subjects. Feature importance analysis found attributes like Delta3 and MFCCs to be important. Various machine learning models were tested, with random forest performing best at 97.2% accuracy after cross-validation. The conclusion discusses further optimizing models and collecting more data. Lessons learned note challenges of limited labeled data and importance of domain knowledge.
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND FIREFLY ALGORITHM
ABSTRACT
Cancer is a globally recognized cause of death. A proper cancer analysis demands the classification of several types of tumor. Investigations into microarray gene expressions seem to be a successful platform for revising genetic diseases. Although the standard machine learning (ML) approaches have been efficient in the realization of significant genes and in the classification of new types of cancer cases, their medical and logical application has faced several drawbacks such as DNA microarray data analysis limitation, which includes an incredible number of features and the relatively small size of an instance. To achieve a reasonable and efficient DNA microarray dataset information, there is a need to extend the level of interpretability and forecast approach while maintaining a great level of precision. In this work, a novel way of cancer classification based on based gene expression profiles is presented. This method is a combination of both Firefly algorithm and Mutual Information Method. First, the features are used to select the features before using the Firefly algorithm for feature reduction. Finally, the Support Vector Machine is used to classify cancer into types. The performance of the proposed system was evaluated by using it to classify datasets from colon cancer; the results of the evaluation were compared with some recent approaches.
Keywords: Feature Selection, Firefly Algorithm, Cancer Disease, Mutual Information
Design of an Intelligent System for Improving Classification of Cancer DiseasesMohamed Loey
The methodologies that depend on gene expression profile have been able to detect cancer since its inception. The previous works have spent great efforts to reach the best results. Some researchers have achieved excellent results in the classification process of cancer based on the gene expression profile using different gene selection approaches and different classifiers
Early detection of cancer increases the probability of recovery. This thesis presents an intelligent decision support system (IDSS) for early diagnosis of cancer-based on the microarray of gene expression profiles. The problem of this dataset is the little number of examples (not exceed hundreds) comparing to a large number of genes (in thousands). So, it became necessary to find out a method for reducing the features (genes) that are not relevant to the investigated disease to avoid overfitting. The proposed methodology used information gain (IG) for selecting the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the Gray Wolf Optimization algorithm (GWO). Finally, the methodology exercises support vector machine (SVM) for cancer type classification. The proposed methodology was applied to three data sets (breast, colon, and CNS) and was evaluated by the classification accuracy performance measurement, which is most important in the diagnosis of diseases. The best results were gotten when integrating IG with GWO and SVM rating accuracy improved to 96.67% and the number of features was reduced to 32 feature of the CNS dataset.
This thesis investigates several classification algorithms and their suitability to the biological domain. For applications that suffer from high dimensionality, different feature selection methods are considered for illustration and analysis. Moreover, an effective system is proposed. In addition, Experiments were conducted on three benchmark gene expression datasets. The proposed system is assessed and compared with related work performance.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
T-BioInfo is a platform for processing, analyzing, and integrating multi-omics data. It is used by multiple research groups to extract meaningful insights from large multi-omics datasets. The platform is expanding its educational capabilities to enable more people to extract meaningful, data-driven insights from omics datasets with biomedical applications. The document provides links to learn more about the platform's research and educational features.
This document provides an overview of a project to build a machine learning model to predict Parkinson's disease. It discusses the process of data cleaning, feature engineering, model building and evaluation using different classification techniques. Random forest was found to perform best with an accuracy of 97.2% at predicting Parkinson's disease status based on speech attributes. Key features identified were Delta3, MFCC3, MFCC9, MFCC8 and HNR05. Further improvements could include additional data and techniques like XGBoost.
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
This summary provides the high-level information from the document in 3 sentences:
The document proposes a Particle Swarm Optimization (PSO) based ensemble classification model to improve classification of high-dimensional biomedical datasets. It develops an optimized PSO technique to select optimal features and initialize weights for base classifiers in the ensemble model. Experimental results on microarray datasets show the proposed model achieves higher accuracy, true positive rate, and lower error rate compared to traditional feature selection based classification models.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
This document proposes using a knowledge graph and machine learning to better target patients and physicians for cardiovascular drug products. It involves building a knowledge graph from various data sources to create a comprehensive view. A graph neural network model is then used to predict the likelihood of a patient developing serious heart failure and the probability of a physician being receptive. The approach achieved higher accuracy than other models. Expanding this to personalized medicine could improve disease diagnosing and treatment selection.
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISIRJET Journal
This document presents a semi-supervised spatial EM framework for microarray analysis to efficiently classify and predict diseases based on gene expression data. It uses a spatial EM algorithm to cluster gene expression data, followed by an SVM classifier to predict diseases and their severity levels. The proposed approach is evaluated based on classification accuracy, computation time, and ability to identify biologically significant genes. Experimental results on disease datasets show improved accuracy compared to other supervised and unsupervised methods. The authors conclude that using the same classifier for gene selection and classification enhances predictive performance, and future work will focus on partitioning genes into clusters correlated with sample categories to further improve accuracy.
A Novel Approach for Cancer Detection in MRI Mammogram Using Decision Tree In...CSCJournals
An intelligent computer-aided diagnosis system can be very helpful for radiologist in detecting and diagnosing microcalcifications’ patterns earlier and faster than typical screening programs. In this paper, we present a system based on fuzzy-C Means clustering and feature extraction techniques using texture based segmentation and genetic algorithm for detecting and diagnosing micro calcifications’ patterns in digital mammograms.We have investigated and analyzed a number of feature extraction techniques and found that a combination of three features, such as entropy, standard deviation, and number of pixels, is the best combination to distinguish a benign micro calcification pattern from one that is malignant. A fuzzy C Means technique in conjunction with three features was used to detect a micro calcification pattern and a neural network to classify it into benign/malignant. The system was developed on a Windows platform. It is an easy to use intelligent system that gives the user options to diagnose, detect, enlarge, zoom, and measure distances of areas in digital mammograms. The present study focused on the investigation of the application of artificial intelligence and data mining techniques to the prediction models of breast cancer. The artificial neural network, decision tree,Fuzzy C Means, and genetic algorithm were used for the comparative studies and the accuracy and positive predictive value of each algorithm were used as the evaluation indicators. 699 records acquired from the breast cancer patients at the MIAS database, 9 predictor variables, and 1 outcome variable were incorporated for the data analysis followed by the 10-fold cross-validation. The results revealed that the accuracies of Fuzzy C Means were 0.9534 (sensitivity 0.98716 and specificity 0.9582), the decision tree model 0.9634 (sensitivity 0.98615, specificity 0.9305), the neural network model 0.96502 (sensitivity 0.98628, specificity 0.9473), the genetic algorithm model 0.9878 (sensitivity 1, specificity 0.9802). The accuracy of the genetic algorithm was significantly higher than the average predicted accuracy of 0.9612. The predicted outcome of the Fuzzy C Means model was higher than that of the neural network model but no significant difference was observed. The average predicted accuracy of the decision tree model was 0.9635 which was the lowest of all 4 predictive models. The standard deviation of the 10-fold cross-validation was rather unreliable. The results showed that the genetic algorithm described in the present study was able to produce accurate results in the classification of breast cancer data and the classification rule identified was more acceptable and comprehensible. Keywords: Fuzzy C Means, Decision Tree Induction, Genetic algorithm, data mining, breast cancer, rule discovery.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
This document summarizes a research paper that proposed a hybrid genetic algorithm and support vector machine (SVM) approach for breast cancer detection. The approach uses genetic algorithms to select the optimal features for input into an SVM classifier. It evaluated the approach on a breast cancer dataset containing 569 cases. The results found that the sequential minimal optimization (SMO) SVM algorithm with genetic algorithm feature selection achieved very high accuracy, recall, and F-measure - outperforming other classification algorithms. It detected breast cancer with 97.71% accuracy, demonstrating the robustness of the proposed hybrid method.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Similar to A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network (20)
SIAM CSE21 Broader Engagement Program FlyerXi Chen
The document advertises the SIAM CSE21 Broader Engagement program which aims to provide professional development opportunities in computational science and engineering to students from underrepresented backgrounds. The program will take place alongside the SIAM CSE21 conference in Fort Worth, Texas from March 1-5, 2021. It will offer travel support, technical programming, tutorials, mentoring, and career development activities to encourage diversity and inclusion in research. Interested applicants should apply by June 26, 2020 through the Sustainable Horizons Institute website listed.
Pan-Cancer Epigenetic Biomarker Selection from Blood Sample Using SAS®Xi Chen
A key focus in current cancer research is the discovery of cancer biomarkers that allow earlier detection with high accuracy and lower costs for both patients and hospitals. Blood samples have long been used as a health status indicator, but DNA methylation signatures in blood have not been fully appreciated in cancer research. Historically, analysis of cancer has been conducted directly with the patient’s tumor or related tissues. Such analyses allow physicians to diagnose a patient’s health and cancer status; however, physicians must observe certain symptoms that prompt them to use biopsies or imaging to verify the diagnosis. This is a post-hoc approach. Our study will focus on epigenetic information for cancer detection, specifically information about DNA methylation in human peripheral blood samples in cancer discordant monozygotic twin-pairs. This information might be able to help us detect cancer much earlier, before the first symptom appears. Several other types of epigenetic data can also be used, but here we demonstrate the potential of blood DNA methylation data as a biomarker for pan-cancer using SAS® 9.3 and SAS® EM. We report that 55 methylation CpG sites measurable in blood samples can be used as biomarkers for early cancer detection and classification.
This 3 sentence summary provides the high level information from the document:
This certificate certifies that Xi Chen scored 80% or higher on the quiz for the SAS online lesson "Introduction to Using SAS Enterprise Miner" on July 11, 2016. The certificate can be printed or the user can close the window. The document is a lesson completion certificate detailing a user's performance on a quiz for an online SAS training course.
Xi Chen successfully completed the SAS e-course "SAS Programming 2: Data Manipulation Techniques" on March 19, 2016, as evidenced by this certificate. The certificate confirms Xi Chen's completion of an SAS programming course focused on data manipulation techniques.
Xi Chen successfully completed the SAS e-course "Statistics I: Introduction to ANOVA, Regression, and Logistic Regression" on January 23, 2016. The course covered introductory topics in analysis of variance, regression, and logistic regression statistical techniques.
Xi Chen successfully completed the SAS SQL 1: Essentials e-course on January 23, 2016. The certificate confirms that Xi Chen completed the SAS SQL 1 training on that date. In 3 sentences, the document certifies that Xi Chen finished the SAS SQL 1 online course on January 23, 2016.
Xi Chen successfully completed the SAS Macro Language 1: Essentials e-course on January 23, 2016. The certificate confirms that Xi Chen completed the SAS Macro Language 1 e-course and achieved essential skills in SAS macros.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMydbops
This presentation, titled "MySQL - InnoDB" and delivered by Mayank Prasad at the Mydbops Open Source Database Meetup 16 on June 8th, 2024, covers dynamic configuration of REDO logs and instant ADD/DROP columns in InnoDB.
This presentation dives deep into the world of InnoDB, exploring two ground-breaking features introduced in MySQL 8.0:
• Dynamic Configuration of REDO Logs: Enhance your database's performance and flexibility with on-the-fly adjustments to REDO log capacity. Unleash the power of the snake metaphor to visualize how InnoDB manages REDO log files.
• Instant ADD/DROP Columns: Say goodbye to costly table rebuilds! This presentation unveils how InnoDB now enables seamless addition and removal of columns without compromising data integrity or incurring downtime.
Key Learnings:
• Grasp the concept of REDO logs and their significance in InnoDB's transaction management.
• Discover the advantages of dynamic REDO log configuration and how to leverage it for optimal performance.
• Understand the inner workings of instant ADD/DROP columns and their impact on database operations.
• Gain valuable insights into the row versioning mechanism that empowers instant column modifications.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsScyllaDB
ScyllaDB monitoring provides a lot of useful information. But sometimes it’s not easy to find the root of the problem if something is wrong or even estimate the remaining capacity by the load on the cluster. This talk shares our team's practical tips on: 1) How to find the root of the problem by metrics if ScyllaDB is slow 2) How to interpret the load and plan capacity for the future 3) Compaction strategies and how to choose the right one 4) Important metrics which aren’t available in the default monitoring setup.
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
A Method to facilitate cancer detection and type classification from gene expression using a deep auto-encoder and neural network
1. A Method to Facilitate Cancer Detection
and Type Classification from Gene
Expression Data using a Deep Autoencoder
and Neural Network
By Xi Chen
March 27, 2019
2. Gene Expression Data Properties.
• Gene expresses differently depending upon various factors such as the type
of cells, environment and disease conditions.
• Gene expression data are highly available due to the increased affordability
of the sequencing technology.
• Gene expression data are multimodality, high dimensional with small
observation number (#row << #column).
• Gene expression data can be used for disease detection and classification,
and drug suggestion.
2
3. Gene Expression Data With Dimension
Reduction
• Using dimension reduction methods, such as PCA, for feature
selection, since gene expression data have high dimension.
• Apply traditional statistical and machine learning methods for
application such as disease detection or classification.
• Problem: how to explain the selected features. E.g. Each PC is a linear
combination of the gene expression features.
3
4. Proposed Drug Suggestion Scheme.
2D Gene Expression Representation
Feature 1
Feature2 Drug Sensitivity
Drug A
Drug B
Drug C
Drug D
Cluster Approaches:
• K-means
• Gaussian Mixture Models 4
5. Problem: Current Gene Expression Data Don’t
Include Drug Results.
• Most gene expression data aren’t associated with well documented
medical records.
• Available records often miss drug information and patient disease
outputs.
5
6. Solving The Harder Classification Problem First,
Then We Could Infer Cluster Approach Works
• In general, a classification problem is similar to a cluster problem, e.g.
k-Nearest Neighbors algorithm.
• If using gene expression data we could achieve high accurate
classification results, we might be able to suggest clustering gene
expression data for drug suggestion.
6
12. Why Not PCA?
• PCA is a descriptive model.
• Each component is a linear
combination of all the
features.
• Hard to explain.
12
13. Cancer Type
Acronym
Full Name
LGG Lower Grad Glioma
UVM Uveal Melanoma
LUSC
Lung squamous cell
carcinoma
GBM Glioblastoma Multiforme
Multiple Type
Classification
• Misclassifications are due to small
sample size.
• Misclassifications are sparse,
clustering potential.
13
14. Conclusion
• Autoencoder to automatically generate feature representations, thus
addressing the very high dimensionality of gene expression data.
• The extracted feature vector captures the non-linearity of the data.
• This approach is scalable for new data after training, and it can
generalize in multi-classification of different types of cancer.
• We have demonstrated the high accuracy and low FNR/FPR of this
method for the majority of the abundant cancer types, and its
potential for handling sub-classification within certain cancers and
identifying metastasis cancers.
14
15. Other Projects—Deep Learning Behind The
Scenes
• Almost all machine learning applications use
similar approaches—Feature Engineering +
Deep Learning.
• E.g. Self-driving cars = CNN + DNN
• Feature engineering CNN
• Deep Learning training DNN
• Deployment
15