This project is to produce a repository database system of drugs, drug features (properties), and drug targets where data can be mined and analyzed. Drug targets are different proteins that drugs try to bind to stop the activities of the protein. Users can utilize the database to mine useful data to predict the specific chemical properties that will have the relative efficacy of a specific target and the coefficient for each chemical property. This database system can be equipped with different data mining approaches/algorithms such as linear, non-linear, and classification types of data modelling. The data models have enhanced with the Genetic Evolution (GE) algorithms. This paper discusses implementation with the linear data models such as Multiple Linear Regression (MLR), Partial Least Square Regression (PLSR), and Support Vector Machine (SVM).
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
Weka project - Classification & Association Rule Generationrsathishwaran
The document discusses using the Weka data mining tool to analyze a US Congressional voting records dataset. It performs classification using 10-fold cross-validation, achieving 83.45% accuracy. It also generates association rules using the Apriori algorithm, setting a minimum support of 0.45 (196 instances) and minimum confidence of 0.9, resulting in 20 itemsets of size 1.
This document provides a literature review on data mining with Oracle 10g using clustering and classification algorithms. It discusses the data mining process and common algorithms used, including Naive Bayes, decision trees, k-means clustering, and neural networks. The review categorizes data mining techniques into supervised learning (classification, prediction) and unsupervised learning (clustering, association rule mining). It also outlines the typical 4-step data mining process of problem definition, data preparation, model building and evaluation, and knowledge deployment.
Classification of Paddy Types using Naïve Bayesian Classifiersijtsrd
Classification is a form of data analysis that can be used extract models describing important data classes or to predict future data trends. Classification is the process of finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. In classification techniques, Naïve Bayesian Classifier is one of the simplest probabilistic classifiers. This paper is to study the Naïve Bayesian Classifier and to classify class label of paddy type data using Naïve Bayesian Classifier. This paper predicts four class labels and displays the selected impacts attribute of each class label by using Naïve Bayesian classifier. Moreover, this paper can predict the types of paddy for paddy dataset by using other classification methods such as Decision Tree and Artificial Neural Network. Furthermore, this system can be used to predict production rate and display the selected impacts attribute of other crops such as soybeans, corns, cottons. This paper focuses on paddy dataset and decides paddy types are Lasbar or Yar Sabar or Yenat Khan Sabar or Sar Ngan Khan Sabar. Mie Mie Aung | Su Mon Ko | Win Myat Thuzar | Su Pan Thaw "Classification of Paddy Types using Naïve Bayesian Classifiers" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26585.pdfPaper URL: https://www.ijtsrd.com/computer-science/other/26585/classification-of-paddy-types-using-na%C3%AFve-bayesian-classifiers/mie-mie-aung
Weka is a popular open source machine learning software that contains tools for data analysis, classification, regression, and clustering. The document demonstrates how to use Weka to perform simple linear regression with one dependent variable and multiple linear regression with multiple dependent variables. It also shows how to use Weka for classification by training a model on demographic data to predict contraceptive method choice. Weka builds models that can make predictions on new test data to classify and regress targets based on patterns learned from training data.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using WEKA.
Medical informatics growth can be observed now days. Advancement in different medical fields
discovers the various critical diseases and provides the guidelines for their cure. This has been possible
only because of well heeled medical databases as well as automation of data analysis process. Towards
this analysis process lots of learning and intelligence is required, the data mining techniques provides the
basis for that and various data mining techniques are available like Decision tree Induction, Rule Based
Classification or mining, Support vector machine, Stochastic classification, Logistic regression, Naïve
bayes, Artificial Neural Network & Fuzzy Logic, Genetic Algorithms. This paper provides the basic of
data mining with their effective techniques availability in medical sciences & reveals the efforts done on
medical databases using data mining techniques for human disease diagnosis.
Weka project - Classification & Association Rule Generationrsathishwaran
The document discusses using the Weka data mining tool to analyze a US Congressional voting records dataset. It performs classification using 10-fold cross-validation, achieving 83.45% accuracy. It also generates association rules using the Apriori algorithm, setting a minimum support of 0.45 (196 instances) and minimum confidence of 0.9, resulting in 20 itemsets of size 1.
This document provides a literature review on data mining with Oracle 10g using clustering and classification algorithms. It discusses the data mining process and common algorithms used, including Naive Bayes, decision trees, k-means clustering, and neural networks. The review categorizes data mining techniques into supervised learning (classification, prediction) and unsupervised learning (clustering, association rule mining). It also outlines the typical 4-step data mining process of problem definition, data preparation, model building and evaluation, and knowledge deployment.
Classification of Paddy Types using Naïve Bayesian Classifiersijtsrd
Classification is a form of data analysis that can be used extract models describing important data classes or to predict future data trends. Classification is the process of finding a set of models that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. In classification techniques, Naïve Bayesian Classifier is one of the simplest probabilistic classifiers. This paper is to study the Naïve Bayesian Classifier and to classify class label of paddy type data using Naïve Bayesian Classifier. This paper predicts four class labels and displays the selected impacts attribute of each class label by using Naïve Bayesian classifier. Moreover, this paper can predict the types of paddy for paddy dataset by using other classification methods such as Decision Tree and Artificial Neural Network. Furthermore, this system can be used to predict production rate and display the selected impacts attribute of other crops such as soybeans, corns, cottons. This paper focuses on paddy dataset and decides paddy types are Lasbar or Yar Sabar or Yenat Khan Sabar or Sar Ngan Khan Sabar. Mie Mie Aung | Su Mon Ko | Win Myat Thuzar | Su Pan Thaw "Classification of Paddy Types using Naïve Bayesian Classifiers" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26585.pdfPaper URL: https://www.ijtsrd.com/computer-science/other/26585/classification-of-paddy-types-using-na%C3%AFve-bayesian-classifiers/mie-mie-aung
Weka is a popular open source machine learning software that contains tools for data analysis, classification, regression, and clustering. The document demonstrates how to use Weka to perform simple linear regression with one dependent variable and multiple linear regression with multiple dependent variables. It also shows how to use Weka for classification by training a model on demographic data to predict contraceptive method choice. Weka builds models that can make predictions on new test data to classify and regress targets based on patterns learned from training data.
This document discusses various privacy preservation techniques in data mining. It summarizes classification, clustering, and association rule learning as common privacy preservation approaches. For classification, it describes decision trees, k-nearest neighbors, artificial neural networks, support vector machines, and naive Bayes models. It provides advantages and disadvantages of these techniques. The document concludes that privacy preservation techniques have emerged to allow for efficient and effective data mining while protecting sensitive data.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preprocessing, classification, regression, clustering, association rule mining and visualization. Weka has algorithms for classification, clustering, finding associations, attribute selection, and data visualization. It allows loading data from files, databases or URLs, exploring and visualizing data, and comparing machine learning models.
This document summarizes and compares different classification algorithms that can be used for disease prediction in data mining. It first introduces disease prediction and classification processes. It then reviews related works that have used various classification algorithms like random forest, support vector machine, and naive Bayes for tasks like disease diagnosis, text classification, and rainfall forecasting. The document also discusses supervised, unsupervised, and semi-supervised machine learning. It provides details on support vector machine and random forest algorithms, describing how each works and is used for classification. Finally, it analyzes the random forest algorithm construction process.
An Efficient Approach for Asymmetric Data ClassificationAM Publications
In many classification problems, the number of targets (eg. intruders) present is very small compared with
the number of clutter objects. Traditional classification approaches usually ignore this class imbalance, causing
performance to experience low accordingly. In contrary, the algorithm considerably imbalanced logistic regression
(IILR) algorithm explicitly addresses class imbalance in its formulation. I am proposing this algorithm and give the
details necessary to employ it for intrusion detection data sets characterized by class imbalance.
WEKA is an open source data mining and machine learning software written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. WEKA was developed at the University of Waikato and contains algorithms for classification like decision trees, clustering like k-means, and preprocessing tools. The document provides examples of using WEKA's clustering and decision tree classification algorithms on sample investment data to segment investors and predict investment choices.
A random decision tree frameworkfor privacy preserving data miningVenkat Projects
This document summarizes a framework for privacy-preserving data mining using a random decision tree algorithm. Multiple parties like banks, insurance companies, and credit card companies share data but need to keep certain attributes private. The random decision tree algorithm partitions data based on each party's needs, encrypts the data using homomorphic encryption, builds a decision tree model on the encrypted data, and allows parties to classify new instances while preserving privacy. It compares the accuracy of random decision trees to traditional ID3 decision trees.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataAM Publications
The motto of this paper is to provide an essential and efficient method to retrieve the data profiles being stored in a particular storage database like the one scientific database. Our country has succeeded in our mars mission in our first attempt. So as far as the information about such an important mission is concerned the information should be retrieved safely as fast as possible. Keeping this in mind we have tried to implement and provide the fastest information retrieval technique. This can lead to better and better retrieval speed in the future missions in lesser time. Here, we have used Information Retrieval-style ranked search. We contemplate the IR-style ranked attend can be exercised to word firms to hold an expert capture the more disclosure between the numerable word firms in large amount templates, much love content-based ranked bring up the rear helps users the way one sees it feel of the large place of business of web content. To show this supposition, we innovated the management of rated accompany for business like information for a current multi-TB experimental certificate like our test. In this attempt, we assess in case the work of genius of differing resemblance, and hence rated attend, try differential data.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
Hybrid prediction model with missing value imputation for medical data 2015-g...Jitender Grover
The document presents a novel hybrid prediction model called HPM-MI that uses K-means clustering and multilayer perceptron (MLP) to improve predictive classification for medical data with missing values. The model first analyzes 11 different imputation techniques using K-means clustering to select the best one for filling missing values in the data. It then uses K-means clustering again to validate class labels and remove incorrectly classified instances before applying the MLP classifier. The model is tested on three medical datasets from the UCI repository and shows improved accuracy, sensitivity, specificity and other metrics compared to existing methods, particularly when datasets have large numbers of missing values.
Data mining involves multiple steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, and pattern evaluation. It has various functionalities including descriptive mining to characterize data, predictive mining for inference, and different mining techniques like classification, association analysis, clustering, and outlier analysis.
Classification on multi label dataset using rule mining techniqueeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
CONFIGURING ASSOCIATIONS TO INCREASE TRUST IN PRODUCT PURCHASEIJwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
This document discusses classification and clustering techniques using the Weka data mining tool. It begins with an introduction to Weka and its capabilities for classification, clustering, and other data mining functions. It then provides an example of using Weka's J48 decision tree algorithm to classify iris flower samples based on sepal and petal attributes. Finally, it demonstrates k-means clustering on customer purchase data from a BMW dealership to group customers into five clusters based on their buying behaviors.
How PROC SQL and SAS® Macro Programming Made My Statistical Analysis Easy? A ...Venu Perla
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time.
April 2020 top read articles in data mining & knowledge management proces...IJDKP
Scope & Topics
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
Performance Analysis of Selected Classifiers in User Profilingijdmtaiir
User profiles can serve as indicators of personal
preferences which can be effectively used while providing
personalized services. Building user files which can capture
accurate information of individuals has been a daunting task.
Several attempts have been made by researchers to extract
information from different data sources to build user profiles
on different application domains. Towards this end, in this
paper we employ different classification algorithmsto create
accurate user profiles based on information gathered from
demographic data. The aim of this work is to analyze the
performance of five most effective classification methods,
namely Bayesian Network(BN), Naïve Bayesian(NB), Naives
Bayes Updateable(NBU), J48, and Decision Table(DT). Our
simulation results show that, in general, the J48has the highest
classification accuracy performance with the lowest error rate.
On the other hand, it is found that Naïve Bayesian and Naives
Bayes Updateable classifiers have the lowest time requirement
to build the classification model
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
As we know that health care industry is completely based on assumptions, which after get tested and verified via various tests and patient have to be depend on the doctors knowledge on that topic . so we made a system that uses data mining techniques to predict the health of a person based on various medical test results. so we can predict the health of that person based on that analysis performed by the system.The system currently design only for heart issues, for that we had used Statlog (Heart) Data Set from UCI Machine Learning Repository it includes attributes like age, sex, chest pain type, cholesterol, sugar, outcomes,etc.for training the system. we only need to passed few general inputs in order to generate the prediction and the prediction results from all algorithms are they merged together by calculating there mean value that value shows the actual outcome of the prediction process which entirely works in background
Can data analysis help predict the future of your heart health?
The Boston Institute of Analytics (BIA) presents a collection of student presentations on data analysis projects tackling the critical topic of heart attack prediction.
Join us as we delve into the world of healthcare analytics and explore how data can be harnessed to identify individuals at risk of heart attack. These presentations offer valuable insights for:
Medical professionals seeking to develop preventative healthcare strategies
Individuals interested in understanding their own heart health risks
Data analysts passionate about applying data analysis for social good
Here's what you'll learn by watching these presentations:
The power of data analysis in predicting heart attacks
Various data analysis techniques used for risk assessment
Real-world examples of heart attack prediction models
Insights and findings from the research of dedicated BIA students
Empower yourself and others with the knowledge of heart health prediction. Watch these presentations and unlock the potential of data analysis in saving lives!
visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
This document discusses using neural networks for software defect prediction. It examines the effectiveness of using a radial basis function neural network and a probabilistic neural network on prediction accuracy and defect prediction compared to other techniques. The key findings are that neural networks provide an acceptable level of accuracy for defect prediction but perform poorly at actual defect prediction. Probabilistic neural networks performed consistently better than other techniques across different datasets in terms of prediction accuracy and defect prediction ability. The document recommends using an ensemble of different software defect prediction models rather than relying on a single technique.
This document discusses various privacy preservation techniques in data mining. It summarizes classification, clustering, and association rule learning as common privacy preservation approaches. For classification, it describes decision trees, k-nearest neighbors, artificial neural networks, support vector machines, and naive Bayes models. It provides advantages and disadvantages of these techniques. The document concludes that privacy preservation techniques have emerged to allow for efficient and effective data mining while protecting sensitive data.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preprocessing, classification, regression, clustering, association rule mining and visualization. Weka has algorithms for classification, clustering, finding associations, attribute selection, and data visualization. It allows loading data from files, databases or URLs, exploring and visualizing data, and comparing machine learning models.
This document summarizes and compares different classification algorithms that can be used for disease prediction in data mining. It first introduces disease prediction and classification processes. It then reviews related works that have used various classification algorithms like random forest, support vector machine, and naive Bayes for tasks like disease diagnosis, text classification, and rainfall forecasting. The document also discusses supervised, unsupervised, and semi-supervised machine learning. It provides details on support vector machine and random forest algorithms, describing how each works and is used for classification. Finally, it analyzes the random forest algorithm construction process.
An Efficient Approach for Asymmetric Data ClassificationAM Publications
In many classification problems, the number of targets (eg. intruders) present is very small compared with
the number of clutter objects. Traditional classification approaches usually ignore this class imbalance, causing
performance to experience low accordingly. In contrary, the algorithm considerably imbalanced logistic regression
(IILR) algorithm explicitly addresses class imbalance in its formulation. I am proposing this algorithm and give the
details necessary to employ it for intrusion detection data sets characterized by class imbalance.
WEKA is an open source data mining and machine learning software written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. WEKA was developed at the University of Waikato and contains algorithms for classification like decision trees, clustering like k-means, and preprocessing tools. The document provides examples of using WEKA's clustering and decision tree classification algorithms on sample investment data to segment investors and predict investment choices.
A random decision tree frameworkfor privacy preserving data miningVenkat Projects
This document summarizes a framework for privacy-preserving data mining using a random decision tree algorithm. Multiple parties like banks, insurance companies, and credit card companies share data but need to keep certain attributes private. The random decision tree algorithm partitions data based on each party's needs, encrypts the data using homomorphic encryption, builds a decision tree model on the encrypted data, and allows parties to classify new instances while preserving privacy. It compares the accuracy of random decision trees to traditional ID3 decision trees.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
Evaluation Mechanism for Similarity-Based Ranked Search Over Scientific DataAM Publications
The motto of this paper is to provide an essential and efficient method to retrieve the data profiles being stored in a particular storage database like the one scientific database. Our country has succeeded in our mars mission in our first attempt. So as far as the information about such an important mission is concerned the information should be retrieved safely as fast as possible. Keeping this in mind we have tried to implement and provide the fastest information retrieval technique. This can lead to better and better retrieval speed in the future missions in lesser time. Here, we have used Information Retrieval-style ranked search. We contemplate the IR-style ranked attend can be exercised to word firms to hold an expert capture the more disclosure between the numerable word firms in large amount templates, much love content-based ranked bring up the rear helps users the way one sees it feel of the large place of business of web content. To show this supposition, we innovated the management of rated accompany for business like information for a current multi-TB experimental certificate like our test. In this attempt, we assess in case the work of genius of differing resemblance, and hence rated attend, try differential data.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
Hybrid prediction model with missing value imputation for medical data 2015-g...Jitender Grover
The document presents a novel hybrid prediction model called HPM-MI that uses K-means clustering and multilayer perceptron (MLP) to improve predictive classification for medical data with missing values. The model first analyzes 11 different imputation techniques using K-means clustering to select the best one for filling missing values in the data. It then uses K-means clustering again to validate class labels and remove incorrectly classified instances before applying the MLP classifier. The model is tested on three medical datasets from the UCI repository and shows improved accuracy, sensitivity, specificity and other metrics compared to existing methods, particularly when datasets have large numbers of missing values.
Data mining involves multiple steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, and pattern evaluation. It has various functionalities including descriptive mining to characterize data, predictive mining for inference, and different mining techniques like classification, association analysis, clustering, and outlier analysis.
Classification on multi label dataset using rule mining techniqueeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
CONFIGURING ASSOCIATIONS TO INCREASE TRUST IN PRODUCT PURCHASEIJwest
Clustering is categorizing data into groups with similar objects. Data mining adds to complexities of clustering a large dataset with various features. Among these datasets, there are electronic business stores which offer their products through web. These stores require recommendation systems which can offer products to the user which the user might require them with higher probability. In this study, previous purchases of users are used to present a sorted list of products to the user. Identifying associations related to users and finding centers increases precision of the recommended list. Configuration of associations and creating a profile for users is important in current studies. In the proposed method, association rules are presented to model user interactions in the web which use time that a page is visited and frequency of visiting a page to weight pages and describes users’ interest to page groups. Therefore, weight of each transaction item describes user’s interest in that item. Analyzing results show that the proposed method presents a more complete model of users’ behavior because it combines weight and membership degree of pages simultaneously for ranking candidate pages. This method has obtained higher accuracy compared to other methods even in higher number of pages.
This document discusses classification and clustering techniques using the Weka data mining tool. It begins with an introduction to Weka and its capabilities for classification, clustering, and other data mining functions. It then provides an example of using Weka's J48 decision tree algorithm to classify iris flower samples based on sepal and petal attributes. Finally, it demonstrates k-means clustering on customer purchase data from a BMW dealership to group customers into five clusters based on their buying behaviors.
How PROC SQL and SAS® Macro Programming Made My Statistical Analysis Easy? A ...Venu Perla
Life scientists collect similar type of data on daily basis. Statistical analysis of this data is often performed using SAS programming techniques. Programming for each dataset is a time consuming job. The objective of this paper is to show how SAS programs are created for systematic analysis of raw data to develop a linear regression model for prediction. Then to show how PROC SQL can be used to replace several data steps in the code. Finally to show how SAS macros are created on these programs and used for routine analysis of similar data without any further hard coding in a short period of time.
April 2020 top read articles in data mining & knowledge management proces...IJDKP
Scope & Topics
Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. There is an urgent need for a new generation of computational theories and tools to assist researchers in extracting useful information from the rapidly growing volumes of digital data.
Performance Analysis of Selected Classifiers in User Profilingijdmtaiir
User profiles can serve as indicators of personal
preferences which can be effectively used while providing
personalized services. Building user files which can capture
accurate information of individuals has been a daunting task.
Several attempts have been made by researchers to extract
information from different data sources to build user profiles
on different application domains. Towards this end, in this
paper we employ different classification algorithmsto create
accurate user profiles based on information gathered from
demographic data. The aim of this work is to analyze the
performance of five most effective classification methods,
namely Bayesian Network(BN), Naïve Bayesian(NB), Naives
Bayes Updateable(NBU), J48, and Decision Table(DT). Our
simulation results show that, in general, the J48has the highest
classification accuracy performance with the lowest error rate.
On the other hand, it is found that Naïve Bayesian and Naives
Bayes Updateable classifiers have the lowest time requirement
to build the classification model
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
As we know that health care industry is completely based on assumptions, which after get tested and verified via various tests and patient have to be depend on the doctors knowledge on that topic . so we made a system that uses data mining techniques to predict the health of a person based on various medical test results. so we can predict the health of that person based on that analysis performed by the system.The system currently design only for heart issues, for that we had used Statlog (Heart) Data Set from UCI Machine Learning Repository it includes attributes like age, sex, chest pain type, cholesterol, sugar, outcomes,etc.for training the system. we only need to passed few general inputs in order to generate the prediction and the prediction results from all algorithms are they merged together by calculating there mean value that value shows the actual outcome of the prediction process which entirely works in background
Can data analysis help predict the future of your heart health?
The Boston Institute of Analytics (BIA) presents a collection of student presentations on data analysis projects tackling the critical topic of heart attack prediction.
Join us as we delve into the world of healthcare analytics and explore how data can be harnessed to identify individuals at risk of heart attack. These presentations offer valuable insights for:
Medical professionals seeking to develop preventative healthcare strategies
Individuals interested in understanding their own heart health risks
Data analysts passionate about applying data analysis for social good
Here's what you'll learn by watching these presentations:
The power of data analysis in predicting heart attacks
Various data analysis techniques used for risk assessment
Real-world examples of heart attack prediction models
Insights and findings from the research of dedicated BIA students
Empower yourself and others with the knowledge of heart health prediction. Watch these presentations and unlock the potential of data analysis in saving lives!
visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
Software Defect Prediction Using Radial Basis and Probabilistic Neural NetworksEditor IJCATR
This document discusses using neural networks for software defect prediction. It examines the effectiveness of using a radial basis function neural network and a probabilistic neural network on prediction accuracy and defect prediction compared to other techniques. The key findings are that neural networks provide an acceptable level of accuracy for defect prediction but perform poorly at actual defect prediction. Probabilistic neural networks performed consistently better than other techniques across different datasets in terms of prediction accuracy and defect prediction ability. The document recommends using an ensemble of different software defect prediction models rather than relying on a single technique.
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONIJDKP
Developing predictive modelling solutions for risk estimation is extremely challenging in health-care
informatics. Risk estimation involves integration of heterogeneous clinical sources having different
representation from different health-care provider making the task increasingly complex. Such sources are
typically voluminous, diverse, and significantly change over the time. Therefore, distributed and parallel
computing tools collectively termed big data tools are in need which can synthesize and assist the physician
to make right clinical decisions. In this work we propose multi-model predictive architecture, a novel
approach for combining the predictive ability of multiple models for better prediction accuracy. We
demonstrate the effectiveness and efficiency of the proposed work on data from Framingham Heart study.
Results show that the proposed multi-model predictive architecture is able to provide better accuracy than
best model approach. By modelling the error of predictive models we are able to choose sub set of models
which yields accurate results. More information was modelled into system by multi-level mining which has
resulted in enhanced predictive accuracy.
This document describes a disease prediction system that uses machine learning algorithms like decision trees, random forests and naive Bayes to predict a disease based on symptoms provided by a patient. The researchers developed a logistic regression model to take in symptoms and predict the likely disease. It was created using Python and aims to help busy professionals more easily identify health issues before they become serious. The system was built using techniques like data collection, preprocessing, model training/evaluation and aims to improve performance over iterations. It was found to provide time savings and early disease warnings compared to traditional diagnosis methods.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
This document discusses approaches to avoid potential biases in data mining. It proposes two post-processing methods to prevent biased rules from being generated during data mining. The first method uses post-processing to generate a small set of high-quality predictive rules directly from the dataset to prevent biased rules. The second method uses categorization with the least amount of bias to also prevent biased rules. Previous work on detecting, measuring, and preventing biases during data mining is also reviewed, including direct and indirect bias prevention techniques as well as legal issues around identifying biases in datasets.
Data mining involves identifying patterns and relationships within large datasets. Techniques like association rule mining, feature selection, dimensionality reduction, and classification algorithms can be applied to datasets like gene expression levels from microarray experiments or health insurance records. The results of clustering and classifying patient data based on gene expression or clinical diagnoses can provide useful medical insights. Machine learning is a subfield of artificial intelligence concerned with algorithms that allow computers to learn from data without being explicitly programmed. It is closely related to data mining and statistics and has many applications including natural language processing, search engines, and medical diagnosis.
MLTDD : USE OF MACHINE LEARNING TECHNIQUES FOR DIAGNOSIS OF THYROID GLAND DIS...cscpconf
Machine learning algorithms are used to diagnosis for many diseases after very important improvements of classification algorithms as well as having large data sets and high performing computational units. All of these increased the accuracy of these methods. The diagnosis of thyroid gland disorders is one of the application for important classification problem. This study majorly focuses on thyroid gland medical diseases caused by underactive or overactive thyroid glands. The dataset used for the study was taken from UCI repository. Classification of this thyroid disease dataset was a considerable task using decision tree algorithm. The overall
prediction accuracy is 100% for training and in range between 98.7% and 99.8% for testing. In this study, we developed the Machine Learning tool for Thyroid Disease Diagnosis (MLTDD), an Intelligent thyroid gland disease prediction tool in Python, which can effectively help to make the right decision, has been designed using PyDev, which is python IDE for Eclipse.
We developed a real-time, visual analytics tool for clinical decision support. The system expands the “recall of past experience” approach that a provider (physician) uses to formulate a course of action for a given patient. By utilizing Big-Data techniques, we enable the provider to recall all similar patients from an institution’s electronic medical record (EMR) repository, to explore “what-if” scenarios, and to collect these evidence-based cohorts for future statistical validation and pattern mining.
The document describes a project that aims to develop a smart health prediction web application using data mining concepts. It takes user inputs like age, gender, blood pressure, height, weight and exercise habits and matches them to data in a MySQL database to predict potential diseases and recommend diet and exercise plans. A rule-based methodology is used where users are categorized into different sets based on their inputs and predefined rules. The project uses HTML, CSS, JavaScript for the front-end and the MySQL database to store and retrieve user data.
The document describes a smart health prediction system that allows users to input symptoms and personal health details. It then uses data mining and rule-based methodology to analyze the inputs, check values like BMI and blood pressure, and predict potential diseases and recommend treatments. The system was developed using technologies like HTML, CSS, JavaScript, MySQL database to store input data and classify users into categories for analysis and outputting health predictions and advice.
The document describes a smart health prediction system that allows users to input symptoms and personal health details. It then uses data mining and rule-based methodology to analyze the inputs, check values like BMI and blood pressure, and predict potential diseases or illnesses based on matches with conditions in the system's database. The system is intended to provide online health guidance and diagnosis when a doctor may not be immediately available. It was developed using technologies like HTML, CSS, JavaScript, MySQL, and follows a knowledge discovery process to predict diseases from user-provided data.
Dr. Kamran Sartipi has extensive experience in research and innovation across several fields including software engineering, data analytics, information security, and healthcare informatics. He has published over 100 papers and books on topics such as software system analysis, architecture recovery, decision support systems, and security and privacy in distributed systems. Currently, he is leading two large research projects involving intelligent middleware security, user behavior pattern discovery, and knowledge extraction from medical data across multiple data centers.
This document introduces digital biomarkers and their use in image classification algorithms. It discusses how digital biomarkers are extracted from images as quantifiable features and optimized to develop multivariate classifiers. The document outlines Contiguity's approach, which extracts obvious and non-obvious features to generate digital biomarkers from histology images. These biomarkers are optimized and combined in classification algorithms. Contiguity applied this method to the CAMELYON16 Grand Challenge dataset, analyzing lymph node images to detect cancer metastases through sampling, filtering, and decision tree classification.
A NOVEL APPROACH TO ERROR DETECTION AND CORRECTION OF C PROGRAMS USING MACHIN...IJCI JOURNAL
There has always been a struggle for programmers to identify the errors while executing a program- be it
syntactical or logical error. This struggle has led to a research in identification of syntactical and logical
errors. This paper makes an attempt to survey those research works which can be used to identify errors as
well as proposes a new model based on machine learning and data mining which can detect logical and
syntactical errors by correcting them or providing suggestions. The proposed work is based on use of
hashtags to identify each correct program uniquely and this in turn can be compared with the logically
incorrect program in order to identify errors.
This document summarizes a project that uses K-means clustering to analyze Twitter data and predict people's reactions to COVID-19 vaccines. The project uses a dataset of COVID-19 vaccine tweets from Kaggle and applies natural language processing and machine learning techniques like preprocessing, sentiment analysis, and unsupervised clustering to classify tweets as expressing positive, negative or neutral sentiment. It then evaluates the model's accuracy in predicting sentiment on test tweet data.
Similar to A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY (20)
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
AI for Legal Research with applications, toolsmahaffeycheryld
AI applications in legal research include rapid document analysis, case law review, and statute interpretation. AI-powered tools can sift through vast legal databases to find relevant precedents and citations, enhancing research accuracy and speed. They assist in legal writing by drafting and proofreading documents. Predictive analytics help foresee case outcomes based on historical data, aiding in strategic decision-making. AI also automates routine tasks like contract review and due diligence, freeing up lawyers to focus on complex legal issues. These applications make legal research more efficient, cost-effective, and accessible.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
DOI:10.5121/ijdkp.2020.10101 01
A WEB REPOSITORY SYSTEM FOR DATA MINING IN
DRUG DISCOVERY
Jiali Tang, Jack Wang and Ahmad Reza Hadaegh
Department of Computer Science and Information System, California State University
San Marcos, San Marcos, USA
ABSTRACT
This project is to produce a repository database system of drugs, drug features (properties), and drug
targets where data can be mined and analyzed. Drug targets are different proteins that drugs try to bind to
stop the activities of the protein. Users can utilize the database to mine useful data to predict the specific
chemical properties that will have the relative efficacy of a specific target and the coefficient for each
chemical property. This database system can be equipped with different data mining
approaches/algorithms such as linear, non-linear, and classification types of data modelling. The data
models have enhanced with the Genetic Evolution (GE) algorithms. This paper discusses implementation
with the linear data models such as Multiple Linear Regression (MLR), Partial Least Square Regression
(PLSR), and Support Vector Machine (SVM).
KEYWORDS
Data Mining, Drug Discovery, Drug Description, Chemoinformatics, and Web Application
1. INTRODUCTION
Data mining is the process of extracting data, analyzing it from many dimensions and
perspectives, and then producing a summary of the information in a useful form that identifies
relationships within the data. There are two types of data mining: descriptive, and predictive data
mining. Descriptive data mining gives information about existing data. Predictive data mining
makes forecasts based on the data [15]. This project performs the predictive approach by training
and testing a series of predictive models on a provided matrix of descriptor values, which
describe the chemical properties of a list of drug compounds. The rows in the matrix represent the
data associated with each specific compound and the columns represent the descriptor values
associated with each common property of all the compounds. The prediction criteria are pIC50
values, the negative logarithms of the compounds’ IC50 values, which represent the
compound/substance concentration required for 50% inhibition of the compounds’ intended
targets. Mathematically, we can view this as follows:
Y = βX + c which is equal to Y = β1X1 + β2X2 + β3X3 + β4X4 +……+ βnXn + c
Where Y refers to prediction criteria, β refers to the model’s coefficients, X refers to the values of
select descriptors, and “c” refers to prediction error between βX and Y. PIC50 is the negative
log(IC50). Thus, the larger the value of the pIC50, and by extension the lower value of the IC50,
the more potent the compound is. In this project, the predictive models were generated using
genetic evolution algorithms: Genetic Algorithm (GA), Differential Evolution (DE), Binary
Particle Swarm Optimization (BPSO) and hybrid form of DE with BPSO (DE-BPSO) [1-14].
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
2
The model coefficients β are calculated based on correlating the compounds’ descriptor values X
with their pIC50 and ends up with a property set. To train the models, we utilized linear machine
learning algorithms such as Partial Least-Squares Regression [14], Support Vector Machines
Regression, and Multiple Linear Regression [13], along with Multi-Layer Perceptron neural
networks.
This step will identify the models that most accurately predict both high- and low-efficacy
compounds and display the properties of a compound that are most useful in prediction. Using
Data Mining with Genetic Evolutionary algorithms will allow users to build high-quality
predictive models for use in drug discovery.
2. RELATED WORKS
The work of Zhong et al. [16] concentrates on the role of Artificial Intelligence (AI) on Drug
Discovery. Drug discovery processes have successfully applied Computer-Aided Drug Design
(CADD) techniques at certain stages to reduce development costs and risks for preclinical and
clinical trials. However, according to Zhong, the decision logic of AI-based models is still
difficult to explain. Our modelling techniques do not have any AI flavour, it is simply based on
Quantitative Structures Analysis Regression (QSAR) modelling.
Another project by Varsou et al [17] used several representative case studies from drug discovery
and computational toxicology to develop a chemo informatics platform, Enalos Suite, using open
source software. Enalos Suite (http://enalossuite.novamechanics.com/) was designed and
developed as a tool to address a variety of cheminformatics problems; expedite tasks performed
in predictive modelling; and allow access, data mining and manipulation for multiple chemical
databases.
Enalos Suite allows for user extension and customization to better tailor its functionality for the
user’s particular field of interest: Nano informatics, biomedical applications, etc. One of the
major differences in our work with Varsou’s work is that we have also used Genetic Evolutionary
techniques to enhance the training of the models.
Other works include “Data Mining and Computational Modelling of High-Throughput Screening
Datasets” that is done by Ekins [18], “Web-based Drug Repurposing Tools” that is written be
Sam E and Athri P [19], “DRUG Discovery Using Data Mining” that is provided by Charanpreet
Kaur and Shweta Bhardwaj [20], and “Using Genetic Algorithms for Data Mining Optimization
in an Educational Web-Based System” that is written by Behrouz Minaei-Bidgoli et al. [21].
Some of these papers may provide part of the Data Mining functionalities we present in our work,
but none are equipped with a strong backend database as we offer in our work.
Figure 1: Interaction of the user with data in the database and the programs
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
3
3. SYSTEM REQUIREMENTS
Users can utilize this database system by simply uploading their data files and creating data
mining tasks. The users can be university professors, students, scientists or researchers.
Users who create accounts can access more extensive data management applications such as
editing and deleting their data from the database. They can also modify and cancel their data
mining tests.
● Manage Account Information: This includes “creating an account”, “finding
password”, and “editing account information (e.g. Name, Email, Password)
● Manage Datasets: This includes “uploading/editing/deleting datasets”, listing all the
datasets uploaded by the user”, “setting configuration about type, disease, target of
the data”, “checking the information about datasets” and “searching/downloading
datasets.
● Manage Data Mining Tasks: This includes “creating/editing/deleting data mining
tasks”, “setting configuration about disease”, “targeting, modeling, and algorithm of
the tasks, “listing all the data mining tasks owned by a user”, “checking the data
mining tasks progress information”, “searching/sorting tasks by date”, “name,
disease, target, model, algorithm, dataset” and “downloading the results of tasks”.
4. SYSTEM OVERVIEW AND DESIGN
Figure 2: Entity Relationship Diagram
4.1 OVERVIEW
The “CSUSM Chemoinfo Drug Discovery Database System” was developed in PHP and is
hosted by an AWS LightSail server that mounts Apache Web Service, MySQL and PHP. User
information is stored in a MySQL database. All the data files uploaded by users and the results
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
4
from data mining are stored in the AWS LightSail [23] server for now. This database system
communicates via custom API with an external data mining program to execute drug discovery.
This data mining program was developed in Python by CSUSM’s Computer Science Department
and is currently hosted in a high-performance AWS EC2 server [24] with two CPUs, 4 GiB
memory, and up to 10 Gigabit network performance.
4.2 DESIGN
4.2.1 DATABASE DESIGN
USERS: (user_id, user_password, user_firstname, user_lastname, user_email, user_salt,
user_validation_code, user_status)
user_password is encrypted with user_salt. When a user registers an account, the system will send
a link to the user to activate their account. This link is created with the user_validation_code and
user_status represents the account is active or not.
DATA: (data_id, data_date_insert, data_time_insert, data_upload_by_user, data_user_id,
data_name_by_user, data_link, data_target_id, data_disease_id, data_type)
The data file uploaded by the user will be saved in the server, and the address of the file will be
saved as data_link. Data_type will record what kind of data this file is, there are three types:
descriptors, target, and labels.
JOBS: (job_id, job_updated_at, job_created_at ,job_start_date, job_start_time, job_status,
job_user_id, job_name_by_user, job_model_id, job_algorithm_id, job_reason, job_des, data_link,
job_attempts, job_queue, job_payload)
Job_status represents the process of the job, and job_reason is to explain the reason or comment if
the job is failed to execute or validate, data_link is to the path of the result.
DATAPACKAGE (package_id, job_id, data_id)
This table shows the relationship between data and jobs. A job has multiple data files and a data
file can be used for multiple jobs.
OUTPUT: (output_id, output_date, output_time, output_job_id, output_user_id, output_link)
Output_link will save the address where the results are saved in the server, this record will be
updated after a job finishes executing.
TYPE: (type_id, type_name, type_description)
This is the data type. There are three data types: descriptor value, target value, and label.
ALGORITHM: (algorithm_id, algorithm_name, algorithm_description)
An algorithm refers to a data mining algorithm that a user chooses for the execution.
MODEL: (model_id, model_name, model_Decription)
A model refers to a data mining model that a user chooses for the execution. For example, MLR, refers to
Multiple Linear Regression.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
5
DISEASE: (disease_id, disease_name, disease_description)
An example of the disease is Alzheimer’s.
TARGET: (target_id, target_name, target_description, target_disease_id)
A target refers to the value of the PCI50s that we explained above.
4.2.2.PROCESS FLOWCHART
The project’s process flow is shown in Figure 3. It shows the relationship between the data, user
and the system.
Figure 3: The workflow of the program
5. IMPLEMENTATION AND VALIDATION
5.1 REGISTER AN ACCOUNT/SIGN IN
In order to use this database system, a user needs to register for an account for the first time or
sign in after that. To register a new account, a user provides his/her email and sets a password. If
the user forgets the password, the system allows the user to receive a new password through
his/her email.
5.1.1 UPLOAD DATA FILES
Users can upload their data files and manage them in the database system. This system only
accepts CSV (Comma-Separated Value) files. After uploading a file, the user can then designate
that file for a data mining test. Each test requires three separate file inputs: one containing a
descriptor value matrix, another containing the prediction target values, and the last and optional
one containing the names of each property for which a descriptor value was calculated.
There are multiple applications such as E-Dragon[22] that can accept raw data from a user, filter
that data, calculate the descriptor values, and output the three required three files. When a user
uploads a file to the database, the user can give the file a name and associate it with a particular
disease or target compound.
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
6
5.1.2 VALIDATION OF THE REGISTRATION - CREATING A TEST
After uploading the three necessary files, a user can create a data mining test. This is done in the
“My Test” page by clicking the “Create a new Test” button. The user can give a name and a
description to this test. Users are required to designate the evolutionary algorithm and machine
learning model that the data mining process will execute. They can additionally specify
population sizes, cutoff conditions, number of populations generated, etc. The algorithms and
machine learning models will be further detailed in later sections.
Figure 4: Sign in page
5.2 USER INSTRUCTION AND EXAMPLES
5.2.1 LOGIN AND ACCOUNT REGISTRATION SYSTEM
The login page is the first page of the application. If a user has an account already, then the user
can sign in with their email address and password (Figure 4) . If a user enters incorrect login
information, the system displays an error message (Figure 5 and Figure 6).
If the user has forgotten the password, then the user can access the “I forgot my password” link
and request the system to email a password reset link to the user. If the user doesn’t have an
account yet, then the user can click the link “Register a new membership” to register a new
account.
Figure 5: Sign in email error Figure 6:Sign in password error
5.2.2 REGISTER A NEW ACCOUNT
Figure 7 and 8 display the account registration page. The user enters their preferred first name,
last name, email address, and password, and then re-enters the password to confirm. Next the user
will be shown the terms of service and must agree to them before he or she can confirm his or her
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
7
registration. Once the user confirms his or her registration, the system displays a message
validating whether the registration succeeded or not (Figure 9).
If the registration succeeds, the user will receive an email from CSUSMChemoinfo with a link
(Figure 10). That’s the link for the user clicks to activate the account. The link is unique for every
user, and this link will automatically login the user account and leads to the dashboard page. If the
Email that the user enters already exist, it will show an error message as seen in Figure 11.
Figure 7: Register new account page Figure 8: Register new account example
Figure 9:Account created notification
Figure 10: Email send to user to activate their account
5.2.3 FORGET/RESET PASSWORD
If the user clicks “I forgot my password”, the user is redirected to the Find Your Password page,
shown in Figure 12 and Figure 13. To begin the password recovery process, the user enters
his/her email address and clicks on the submit button. Figures 14-17 illustrate the password
recovery process.
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
8
Figure 11: Register error: email already exist Figure 12: Find password page
Figure 15: Email sent to user for reset password
Figure 16: Reset password Figure 17: Reset password example
5.2.4 DASHBOARD
The dashboard page serves as the main menu for a user. It summarizes the user’s test statuses and
displays the current progress of tests being run. The main navigation bar is on the left side of the
dashboard (Figure 18) and contains the following links: the “My Data” link leads to the uploaded
data management page, the “My Tests” link leads to the test management page, the “Profile” link
leads to the user’s personal information page, and the “Logout” link logs out the current user’s
account and redirects the user to the main login page. At the top of the navigation bar, there is a
button that allows the user to hide the navigation bar. Figure 18 shows the unhidden mode and
Figure 19 shows the hidden mode.
Case Status: As shown in Figure 20, the Case status has two parts, a circle graph, and a progress
line table. In the circle graph, the colors represent the statuses of all of a user’s tests. The progress
line shows the number of jobs in certain status.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
9
Figure 18: Dashboard
Figure 19: Navigation bar is hidden
Figure 20: Dashboard - Case status
Figure 21: Data List fields
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
10
Figure 22: Data List - Data names are clickable
5.2.5 DATA MANAGEMENT
Data List: The Data List page displays all the files uploaded by a user along with each file’s
associated information, such as data name, data type, disease, target and uploaded date. (Figure
21) A user can download a data file (Figure 23) or delete a data file by clicking the related icons
in each row (Figure 24). The user can also click the data name field to view more detailed
information about the data (Figure 22).
To upload a new data file, the user can click the “Upload Data” button, which sends the user to
the file upload page.
Users can also find data files using the keyword search function. For example, in Figure 25, the
keyword is “descriptor”. The list shows all the data files with the word “descriptor” in their
associated information, in this case data type.
Figure 23: Data List – download icon
Figure 24: Data List – Delete icon and warning notification
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
11
Figure 25: Data List – search example
Figure 26: Data Detail – not editable
Show Data Detail / Update Data: Each field in the Data Name column is a link to the associated
file’s Data Detail page, which allows the user to update the file information or delete the file.
If this data file is being used by a test, then it cannot be updated or deleted. The delete and update
buttons are disabled (Figure 26). Otherwise, the data file can be updated or deleted (Figure 27).
Figure 27: Data Detail – editable
For example, in the “Data Detail Page”, the user can replace this file with a new file, give the file
a new name or description, and/or associate it with a particular disease or target compounds. Once
the changes are confirmed by clicking the “Update” button, the user will be redirected to the
“Data List” page (Figure 28), which shows the updated file information as shown in Figure 29.
Upload Data: In the upload data page, users can configure the metadata for the data files that
they upload. As seen in Figure 29, the user can name the data file, and choose the related disease
and the target. The user can also choose the type of this data file (target, descriptor, label). The
user can also associate the data file with particular diseases or target compounds listed in
dropdown menus, as shown in Figure 30. Clicking the “Update” button will save this information
to the database.
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
12
Figure 28: Data Detail – update example
Figure 29: Data List is updated after Data information is changed
Figure 30: Upload Data
Right now, this system allows data files to be associated with Alzheimer’s disease and HIV. We
have done some work for HIV-Protease, HIV-Integrase for HIV and gamma-secretases for
Alzheimer’s. The data mining process requires files for three types of data: calculated descriptor
values, target/experimental values, and compound property labels. Users can acquire these data
files from E-Dragon [22].
A user can upload a new file by clicking the “Choose File” button, as shown in Figure 31. Only
.csv (Comma-Separated Values) files are currently accepted. Once the file has been uploaded, the
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
13
user can access the file in the system at any time. Should the upload fail, the system will display
an error message that may help the user diagnose the cause of the error (Figure 32).
Figure 31: Upload Data – choose file
Figure 32: Upload Data – Error message
6. TEST MANAGEMENT
Test List: The Test List page will display all of a user’s tests and each test’s metadata, such as
name, description, algorithm, model, date of creation, and data of completion (Figure 33). Users
can also download the data mining results. Embedded in the test’s name is a link that can show
more detailed information about the test. If the user wants to create a new test, then clicking the
“Create new test” button will redirect the user to the “Create Test” page.
Figure 33: Test List
Users can also retrieve tests by keyword search. For example, in Figure 34, entering the keyword
“completed” will return a list of all of a user’s completed tests.
Figure 34: Test List – search function
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
14
Create Test: In the Create Test page, users can configure their test and submit the tests to be run.
As Figure 35 shows, the user can give a name and description to the test and choose a particular
algorithm and machine learning model that will be executed for the test. All available algorithms
and machine learning models can be selected from the dropdown menus (Figure 35). There are
also dropdown menus that list the user’s previously uploaded data. The user must also select three
files from these menus: one file with calculated descriptor values, the second with the
target/experimental values, and compound property labels.
After setting up those configurations, the user can click the button “Submit” to submit a request to
execute the test. When the test completes, its “Status” will change to “Complete”.
Figure 35: Creating a new Test
Figure 36: Dropdown menus for test options
Test Details: By clicking the name of a test, a user will be redirected to the Test Details page,
which displays more detailed information and additional options for that particular test. If the test
is currently being run, then it cannot be updated or deleted. As shown in Figure 37, the delete
button is disabled. Once the test is completed, then it can be deleted (Figure 38). If the test ends
15. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
15
prematurely because of an error, which will be displayed in the “Status” column, the test’s
configuration can be updated or the test can simply be deleted.
Aside from the information about the test, the Test Details page also shows the data files that are
being used for the test. The user can access additional information about the data by clicking the
“More detail” button, which will lead to the Data Details page. The Data Details page also
provides a “Download” button to allow users to directly download a particular file. (Figure 37).
Figure 37: Test Detail – not editable
The Comment/Error column displays any error messages returned by test failures along with any
comments about the failed tests. The Download Results column displays download links for the
results from successfully completed tests.
Figure 38: Test Detail – delete
Update Test: A data mining test may fail as a result of incorrect input in the data files or a
network connection error. Upon a test failure, users can choose to either delete the test or reset the
configuration of the test and execute it again. A user can first go to the Test Details page to check
the reasons for the failure. Based on this information the user can then alter the test configuration
to remove the failure conditions, such as selecting a different algorithm or machine-learning
model or updating to the correct data files. After the user confirms the updated test configuration
and by clicking the “Update” button, the data-mining test will be sent to the data mining program
for another execution.
16. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
16
7. CONCLUSION AND FUTURE WORK
The paper described a database service and web application to allow any researcher, especially
anyone without computer science experience, to utilize a data mining application for drug
discovery. This service, hosted on Amazon Web Services, allows users to upload experimental
input, run tests, and download test results.
There exist other high-quality data mining services, but this work’s specific combination of
features, such as machine learning models augmented by evolutionary algorithms and an
accessible database to store input and output data, has not been implemented in a public
application as far as the authors are aware.
The next phase of implementation can involve adding nonlinear models such as nonlinear SVM,
Artificial Neural Network (ANN), and classification models such as Random Forest (RF) to this
project’s list of machine-learning models.
If the new version of E-Dragon contains an API, another possible expansion could be
development of a native API for the project that connects to the hypothetical E-Dragon API. This
would allow users to directly submit raw data, such as compound SMILES files, and have E-
Dragon automatically filter the raw data and calculate descriptor values without needing a third-
party service. This would also improve data security for the application.
The project’s cloud infrastructure could also be improved in speed, scalability, and reliability
using various tools and systems. The data-mining application is currently hosted on a single
instance of AWS Elastic Compute Cloud service; to improve the three characteristics listed
before, there are several services that could be added to the stack.
The first step is to create the launch template, which contains configuration data such as network
requirements, instance types, and disk images that will be mounted on the instances. With a
template, the creation of a scaling group will allow setting dynamic parameters, such as CPU load
or memory usage which can trigger the creation and invocation of new instances, as well as the
termination of those instances when demand declines, not only optimizing scalability but also
reducing costs. Finally, the implementation of an Elastic Load Balancer will allow the incoming
traffic and task request to be distributed among the available servers (instances) by allocating
them to the highest availability server, which can significantly improve the speed and reliability
of our system.
REFERENCES
[1] Ko, Gene, Reddy, Srinivas, Garg, Rajni, Kumar, Sunil, & Hadaegh, Ahmad, (2012) “Computational
Modelling Methods for QSAR Studies on HIV-1 Integrase Inhibitors (2005-2010),”. Curr Comput
Aided Drug Des. Vol. 8, No 4, pp 255-270.
[2] Thakor, Falguni, Hadaegh, Ahmad, & Zhang, Xiaoyu, (2017), ” Comparative study of Differential
Evolutionary-Binary Particle Swarm Optimization (DE-BPSO) algorithm as a feature selection
technique with different linear regression models for analysis of HIV-1 Integrase Inhibition features
of Aryl β-Diketo Acids”, Proceedings of 9th International Conference on Bioinformatics and
Computational Biology, Honolulu, Hawaii, USA, ISBN: 978–1–943436–07–1, pp 179-184.
[3] Kane Ian, & Hadaegh Ahmad, “Non-linear Quantitative Structure-Activity Relationship (QSAR)
Models for the Prediction of HIV Drug Performance”, (2015), 24th International Conference on
Software Engineering and Data Engineering, pp 63-68. Vol 1, ISBN: 9781510812277, San Diego,
CA.
17. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.10, No.1, January 2020
17
[4] Galvan Richard, Kashani, Maninatalsadat, & Hadaegh, Ahmad, “Improving Pharmacological
Research of HIV-1 Integrase Inhibition Using Differential Evolution-Binary Particle Swarm
Optimization and Non-Linear Adaptive Boosting Random Forest Regression”,(2015), IEEE
International Workshop on Data Integration and Mining San Francisco, Information Reuse and
Integration (IRI), IEEE International Conference, pp 485-490, DOI: 10.1109/IRI.2015.80. INSPEC
Accession Number: 15556631. San Francisco, CA.
[5] Kashani, Maninatalsadat, Galvan Richard, & Hadaegh Ahmad, “Improving the Feature Selection for
the Development of Linear Model for Discovery of HIV-1 Integrase Inhibitors”, (2015) ABDA'15
International Conference on Advances in Big Data Analytics. In Proceeding of the 2015 International
Conferences on Advances on Big Data Analyses, pp 150-154. ISBN: 1-60132-411-1, Las Vegas,
Nevada.
[6] Ko, Gene, Garg, Rajni, Kumar, Sunil, Kumar, Bailey, Barbara, & Hadaegh Ahmad, “A Hybridized
Evolutionary Algorithm for Feature Selection of Chemical Descriptors for Computational QSAR
Modeling of HIV-1 Integrase Inhibitors”, (2013), Computational Science Curriculum Development
Forum and Applied Computational Science and Engineering Student Support for Industry, San Diego
State University.
[7] Ko, Gene, Garg, Rajni, Kumar, Sunil, Bailey, Barbara, & Hadaegh Ahmad, “Differential Evolution-
Binary Particle Swarm Optimization for the Analysis of Aryl b-Kiketo Acids for HIV-1 Integres
Inhibition, (2012), WCCI 2012 IEEE World Congress on Computational Intelligence. Brisbane
Australia, pp 1849-1855.
[8] Ko, Gene, Reddy, Srinivas, Kumar, Kumar, Bailey, Barbara, Garg, Rajni, & Hadaegh, Ahmad,
“Evolutionary Computational Modelling of β-Diketo Acids for Virtual Screening of HIV-1 Integrase
Inhibitors”, (2012), IEEE World Congress on Computational Intelligence, Brisbane, Australia.
[9] Ko, Gene, Reddy, Srinivas, Kumar, Kumar, Garg, Rajni, & Hadaegh, Ahmad “Evolutionary
Computational Modelling of β-Diketo Acids for Virtual Screening of HIV-1 Integrase Inhibitors”,
(2012), 243rd National Meeting of the American Chemical Society, San Diego, CA.
[10] Gonzales, Miguel, Turner, Chris, Ko, Gene, & Hadaegh, Ahmad, “Binary Particle Swarm
Optimization Model of Dimeric Aryl Diketo Acid Inhibitors for HIV-1 Integrase” (2012), 243rd
National Meeting of the American Chemical Society, San Diego, CA.
[11] Ko, Gene, Reddy, Srinivas, Kumar, Sunil, Garg, Rajni, & Hadaegh, Ahmad, “Analysis of HIV-1
Integrase Inhibitors Using Computational QSAR Modelling”, (2012), Computational Science
Curriculum Development Forum and Applied Computational Science and Engineering Student
Support for Industry, San Diego State University.
[12] Garg Rajni, Reddy Srinivas, Zhang Xiaoyu, & Hadaegh Ahmad, “MUT-HIV: Mutation database of
HIV proteases”, (2007), American Chemical Society (ACS) 234th National Meeting & Exposition,
Boston, MA USA CINF 42.
[13] MLR: http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm
[14] PLSR: https://www.mathworks.com/help/stats/plsregress.html
[15] https://techdifferences.com/difference-between-descriptive-and-predictive-data-mining.html
[16] Zhong et al. Artificial intelligence in drug design. Sci China Life Sci. 2018 Jul 18. doi:
10.1007/s11427-018-9342-2. [Epub ahead of print]
[17] Varsou Dimitra-Danai, Nikolakopoulos, Spyridon, Tsoumanis Andreas, Melagraki Georgia, &
Afantitis, Antreas, “New Cheminformatics Platform for Drug Discovery and Computational
Toxicology”, (2018), Methods Mol Biol. 2018; 1800:287-311. doi: 10.1007/978-1-4939-7899-1_14