This document presents a study that compares different machine learning techniques for predicting tooth decay (dental caries) in children under 5 years old. The study uses data from interviews of nearly 4,000 Brazilian children. Four machine learning methods are evaluated on their ability to predict caries: decision trees, neural networks, k-nearest neighbors (kNN), and support vector machines (SVM). The neural network technique achieved the best performance, followed by SVM, in terms of classification accuracy and area under the ROC curve. Decision trees were also able to extract useful rules for understanding factors that influence caries occurrence.
Effect of Data Size on Feature Set Using Classification in Health Domaindbpublications
In health domain, the major critical issue is prediction of disease in early stage. Prediction of disease is mainly based on the experience of physician so many machine learning approach contribute their work in the prediction of disease. In existing approaches, either prediction or feature selection has been concentrated. The aim of this paper is to present the effect of data size and set of features in the prediction of disease in health domain using Naïve Bayes. This shows how each attribute or combination of attribute behaves on different size of dataset.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
The increasing need for data driven decision making recently has resulted in the application of data mining in various fields including the educational sector which is referred to as educational data mining. The need for improving the performance of data mining models has also been identified as a gap for future researcher. In Nigeria, higher educational institutions collect various students’ data, but these data are rarely used in any decision or policy making to improve the academic performance of students. This research work, attempts to improve the performance of data mining models for predicting students’ academic performance using stacking classifiers ensemble and synthetic minority over-sampling techniques. The research was conducted by adopting and evaluating the performance of J48, IBK and SMO classifiers. The individual classifiers models, standard stacking classifier ensemble model and stacking classifiers ensemble model were trained and tested on 206 students’ data set from the faculty of science federal university Dutse. Students’ specific previous academic performance records at Unified Tertiary Matriculation Examination, Senior Secondary Certificate Examination and first year Cumulative Grade Point Average of students are used as data inputs in WEKA 3.9.1 data mining tool to predict students’ graduation classes of degrees at undergraduate level. The result shows that application of synthetic minority over-sampling technique for class balancing improves all the various models performance with the proposed modified stacking classifiers ensemble model outperforming the various classifiers models in both performance accuracy and RSME values making it the best model.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The main objective of this paper is to develop a basic prototype model which can determine and extract
unknown knowledge (patterns, concepts and relations) related with multiple factors from past database records of
specific students. Data mining is science and engineering study of extracting previously undiscovered patterns
from a huge set of data. Data mining techniques are helpful for decision making as well as for discovering patterns
of data. In this paper students eligibility prediction system using Rule based classification is proposed to predict
the eligibility of students based on their details with high prediction accuracy. In Educational Institutes, a
tremendous amount of data is generated. This paper outlines the idea of predicting a particular student’s placement
eligibility by performing operations on the data stored. In this paper an efficient algorithm with the technique
Fuzzy for prediction is proposed.
Genome-wide transcription profiling is a powerful technique in studying disease susceptible footprints. Moreover, when applied to disease tissue it may reveal quantitative and qualitative alterations in gene expression that give information on the context or underlying basis for the disease and may provide a new diagnostic approach. However, the data obtained from high-density microarrays is highly complex and poses considerable challenges in data mining. Past researches prove that neuro diseases damage the brain network interaction, protein- protein interaction and gene-gene interaction. A number of neurological research paper also analyze the relationship among damaged part. Analysis of gene-gene interaction network drawn by using state-of-the-art gene database of Alzheimer’s patient can conclude a lot of information. In this paper we used gene dataset affected with Alzheimer’s disease and normal patient’s dataset from NCBI databank. After proper processing the .CEL affymetrix data using RMA, we use the processed data to find gene interaction outputs. Then we filter the output files using probe set filtering attributes p-value and fold count and draw a gene-gene interaction network. Then we analyze the interaction network using GeneMania software.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
Effect of Data Size on Feature Set Using Classification in Health Domaindbpublications
In health domain, the major critical issue is prediction of disease in early stage. Prediction of disease is mainly based on the experience of physician so many machine learning approach contribute their work in the prediction of disease. In existing approaches, either prediction or feature selection has been concentrated. The aim of this paper is to present the effect of data size and set of features in the prediction of disease in health domain using Naïve Bayes. This shows how each attribute or combination of attribute behaves on different size of dataset.
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
The increasing need for data driven decision making recently has resulted in the application of data mining in various fields including the educational sector which is referred to as educational data mining. The need for improving the performance of data mining models has also been identified as a gap for future researcher. In Nigeria, higher educational institutions collect various students’ data, but these data are rarely used in any decision or policy making to improve the academic performance of students. This research work, attempts to improve the performance of data mining models for predicting students’ academic performance using stacking classifiers ensemble and synthetic minority over-sampling techniques. The research was conducted by adopting and evaluating the performance of J48, IBK and SMO classifiers. The individual classifiers models, standard stacking classifier ensemble model and stacking classifiers ensemble model were trained and tested on 206 students’ data set from the faculty of science federal university Dutse. Students’ specific previous academic performance records at Unified Tertiary Matriculation Examination, Senior Secondary Certificate Examination and first year Cumulative Grade Point Average of students are used as data inputs in WEKA 3.9.1 data mining tool to predict students’ graduation classes of degrees at undergraduate level. The result shows that application of synthetic minority over-sampling technique for class balancing improves all the various models performance with the proposed modified stacking classifiers ensemble model outperforming the various classifiers models in both performance accuracy and RSME values making it the best model.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The main objective of this paper is to develop a basic prototype model which can determine and extract
unknown knowledge (patterns, concepts and relations) related with multiple factors from past database records of
specific students. Data mining is science and engineering study of extracting previously undiscovered patterns
from a huge set of data. Data mining techniques are helpful for decision making as well as for discovering patterns
of data. In this paper students eligibility prediction system using Rule based classification is proposed to predict
the eligibility of students based on their details with high prediction accuracy. In Educational Institutes, a
tremendous amount of data is generated. This paper outlines the idea of predicting a particular student’s placement
eligibility by performing operations on the data stored. In this paper an efficient algorithm with the technique
Fuzzy for prediction is proposed.
Genome-wide transcription profiling is a powerful technique in studying disease susceptible footprints. Moreover, when applied to disease tissue it may reveal quantitative and qualitative alterations in gene expression that give information on the context or underlying basis for the disease and may provide a new diagnostic approach. However, the data obtained from high-density microarrays is highly complex and poses considerable challenges in data mining. Past researches prove that neuro diseases damage the brain network interaction, protein- protein interaction and gene-gene interaction. A number of neurological research paper also analyze the relationship among damaged part. Analysis of gene-gene interaction network drawn by using state-of-the-art gene database of Alzheimer’s patient can conclude a lot of information. In this paper we used gene dataset affected with Alzheimer’s disease and normal patient’s dataset from NCBI databank. After proper processing the .CEL affymetrix data using RMA, we use the processed data to find gene interaction outputs. Then we filter the output files using probe set filtering attributes p-value and fold count and draw a gene-gene interaction network. Then we analyze the interaction network using GeneMania software.
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
Fuzzy Association Rule Mining based Model to Predict Students’ Performance IJECEIAES
The major intention of higher education institutions is to supply quality education to its students. One approach to get maximum level of quality in higher education system is by discovering knowledge for prediction regarding the internal assessment and end semester examination. The projected work intends to approach this objective by taking the advantage of fuzzy inference technique to classify student scores data according to the level of their performance. In this paper, student’s performance is evaluated using fuzzy association rule mining that describes Prediction of performance of the students at the end of the semester, on the basis of previous database like Attendance, Midsem Marks, Previous semester marks and Previous Academic Records were collected from the student’s previous database, to identify those students which needed individual attention to decrease fail ration and taking suitable action for the next semester examination.
Correlation based feature selection (cfs) technique to predict student perfro...IJCNCJournal
Education data mining is an emerging stream which h
elps in mining academic data for solving various
types of problems. One of the problems is the selec
tion of a proper academic track. The admission of a
student in engineering college depends on many fact
ors. In this paper we have tried to implement a
classification technique to assist students in pred
icting their success in admission in an engineering
stream.We have analyzed the data set containing inf
ormation about student’s academic as well as socio-
demographic variables, with attributes such as fami
ly pressure, interest, gender, XII marks and CET ra
nk
in entrance examinations and historical data of pre
vious batch of students. Feature selection is a pro
cess
for removing irrelevant and redundant features whic
h will help improve the predictive accuracy of
classifiers. In this paper first we have used featu
re selection attribute algorithms Chi-square.InfoGa
in, and
GainRatio to predict the relevant features. Then we
have applied fast correlation base filter on given
features. Later classification is done using NBTree
, MultilayerPerceptron, NaiveBayes and Instance bas
ed
–K- nearest neighbor. Results showed reduction in c
omputational cost and time and increase in predicti
ve
accuracy for the student model
Natural language processing through the subtractive mountain clustering algor...ijnlc
In this work, the subtractive mountain clustering algorithm has been adapted to the
problem of natural languages processing in view to construct a chatbot that answers questions
posed by the user. The implemented algorithm version allosws for the association of a set of words
into clusters. After finding the centre of every cluster — the most relevant word, all the others are
aggregated according to a defined metric adapted to the language processing realm. All the relevant
stored information (necessary to answer the questions) is processed, as well as the questions, by the
algorithm. The correct processing of the text enables the chatbot to produce answers that relate
to the posed queries. Since we have in view a chatbot to help elder people with medication, to
validate the method, we use the package insert of a drug as the available information and formulate
associated questions. Errors in medication intake among elderly people are very common. One of
the main causes for this is their loss of ability to retain information. The high amount of medicine
intake required by the advanced age is another limiting factor. Thence, the design of an interactive
aid system, preferably using natural language, to help the older population with medication is in
demand. A chatbot based on a subtractive cluster algorithm is the chosen solution.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
With the growth of voluminous amount of data in educational institutes’, the need is to mine the large dataset to produce some useful information out of it. In this research we focused on to form a decision support system for the educational institutes’ which can help them to know about the placement possibility of students. Our research is not limited to find out placement possibility but we did multi-level analysis on student performance dataset which will predict that what level of interview process a student is likely to pass. For this we have applied Naïve Bayes and Improved Naïve Bayes which is integrated with relief feature selection technique to obtain the prediction. Data analysis was done using NetBeans and WEKA. For this our proposed technique gave better accuracy than existing naïve Bayes which was 84.7% and naïve Bayes gave 80.96% accuracy.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
Feature selection is considered as a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data. However, identification of useful features from hundreds or even thousands of related features is not an easy task. Selecting relevant genes from microarray data becomes even more challenging owing to the high dimensionality of features, multiclass categories involved and the usually small sample size. In order to improve the prediction accuracy and to avoid incomprehensibility due to the number of features different feature selection techniques can be implemented. This survey classifies and analyzes different approaches, aiming to not only provide a comprehensive presentation but also discuss challenges and various performance parameters. The techniques are generally classified into three; filter, wrapper and hybrid.
Machine Learning and the Value of Health TechnologiesCovance
Machine learning can be applied through the development of algorithms that can unravel or "learn" complex associations in large datasets with limited human input. These algorithms are capable of making predictions that go beyond our capabilities as humans and they can process and analyze more possibilities. Machine learning may help us find answers to questions that we didn't even think of in the past, revealing evidence previously hidden among the data. We can use these methods to dig up imperceptible patterns and allow health technologies to be used at the right time and for the right patient population. (A4 Version)
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
Drug discovery and development is a long and expensive process and over time has notoriously bucked Moore’s law that it now has its own law called Eroom’s Law named after it (the opposite of Moore’s). It is estimated that the attrition rate of drug candidates is up to 96% and the average cost to develop a new drug has reached almost $2.5 billion in recent years. One of the major causes for the high attrition rate is drug safety, which accounts for 30% of the failures.
Even if a drug is approved in market, it could be withdrawn due to safety problems. Therefore, evaluating drug safety extensively as early as possible is paramount in accelerating drug discovery and development. This talk provides a high-level overview of the current process of rational drug design that has been in place for many decades and covers some of the major areas where the application of AI, Deep learning and ML based techniques have had the most gains.
Specifically, this talk covers a variety of drug safety related AI and ML based techniques currently in use which can generally divided into 3 main categories:
1. Discovery,
2. Toxicity and Safety, and
3. Post-Market Monitoring.
We will address the recent progress in predictive models and techniques built for various toxicities. It will also cover some publicly available databases, tools and platforms available to easily leverage them.
We will also compare and contrast various modeling techniques including deep learning techniques and their accuracy using recent research. Finally, the talk will address some of the remaining challenges and limitations yet to be addressed in the area of drug discovery and safety assessment.
Fuzzy Association Rule Mining based Model to Predict Students’ Performance IJECEIAES
The major intention of higher education institutions is to supply quality education to its students. One approach to get maximum level of quality in higher education system is by discovering knowledge for prediction regarding the internal assessment and end semester examination. The projected work intends to approach this objective by taking the advantage of fuzzy inference technique to classify student scores data according to the level of their performance. In this paper, student’s performance is evaluated using fuzzy association rule mining that describes Prediction of performance of the students at the end of the semester, on the basis of previous database like Attendance, Midsem Marks, Previous semester marks and Previous Academic Records were collected from the student’s previous database, to identify those students which needed individual attention to decrease fail ration and taking suitable action for the next semester examination.
Correlation based feature selection (cfs) technique to predict student perfro...IJCNCJournal
Education data mining is an emerging stream which h
elps in mining academic data for solving various
types of problems. One of the problems is the selec
tion of a proper academic track. The admission of a
student in engineering college depends on many fact
ors. In this paper we have tried to implement a
classification technique to assist students in pred
icting their success in admission in an engineering
stream.We have analyzed the data set containing inf
ormation about student’s academic as well as socio-
demographic variables, with attributes such as fami
ly pressure, interest, gender, XII marks and CET ra
nk
in entrance examinations and historical data of pre
vious batch of students. Feature selection is a pro
cess
for removing irrelevant and redundant features whic
h will help improve the predictive accuracy of
classifiers. In this paper first we have used featu
re selection attribute algorithms Chi-square.InfoGa
in, and
GainRatio to predict the relevant features. Then we
have applied fast correlation base filter on given
features. Later classification is done using NBTree
, MultilayerPerceptron, NaiveBayes and Instance bas
ed
–K- nearest neighbor. Results showed reduction in c
omputational cost and time and increase in predicti
ve
accuracy for the student model
Natural language processing through the subtractive mountain clustering algor...ijnlc
In this work, the subtractive mountain clustering algorithm has been adapted to the
problem of natural languages processing in view to construct a chatbot that answers questions
posed by the user. The implemented algorithm version allosws for the association of a set of words
into clusters. After finding the centre of every cluster — the most relevant word, all the others are
aggregated according to a defined metric adapted to the language processing realm. All the relevant
stored information (necessary to answer the questions) is processed, as well as the questions, by the
algorithm. The correct processing of the text enables the chatbot to produce answers that relate
to the posed queries. Since we have in view a chatbot to help elder people with medication, to
validate the method, we use the package insert of a drug as the available information and formulate
associated questions. Errors in medication intake among elderly people are very common. One of
the main causes for this is their loss of ability to retain information. The high amount of medicine
intake required by the advanced age is another limiting factor. Thence, the design of an interactive
aid system, preferably using natural language, to help the older population with medication is in
demand. A chatbot based on a subtractive cluster algorithm is the chosen solution.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
With the growth of voluminous amount of data in educational institutes’, the need is to mine the large dataset to produce some useful information out of it. In this research we focused on to form a decision support system for the educational institutes’ which can help them to know about the placement possibility of students. Our research is not limited to find out placement possibility but we did multi-level analysis on student performance dataset which will predict that what level of interview process a student is likely to pass. For this we have applied Naïve Bayes and Improved Naïve Bayes which is integrated with relief feature selection technique to obtain the prediction. Data analysis was done using NetBeans and WEKA. For this our proposed technique gave better accuracy than existing naïve Bayes which was 84.7% and naïve Bayes gave 80.96% accuracy.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
Feature selection is considered as a problem of global combinatorial optimization in machine learning, which reduces the number of features, removes irrelevant, noisy and redundant data. However, identification of useful features from hundreds or even thousands of related features is not an easy task. Selecting relevant genes from microarray data becomes even more challenging owing to the high dimensionality of features, multiclass categories involved and the usually small sample size. In order to improve the prediction accuracy and to avoid incomprehensibility due to the number of features different feature selection techniques can be implemented. This survey classifies and analyzes different approaches, aiming to not only provide a comprehensive presentation but also discuss challenges and various performance parameters. The techniques are generally classified into three; filter, wrapper and hybrid.
Machine Learning and the Value of Health TechnologiesCovance
Machine learning can be applied through the development of algorithms that can unravel or "learn" complex associations in large datasets with limited human input. These algorithms are capable of making predictions that go beyond our capabilities as humans and they can process and analyze more possibilities. Machine learning may help us find answers to questions that we didn't even think of in the past, revealing evidence previously hidden among the data. We can use these methods to dig up imperceptible patterns and allow health technologies to be used at the right time and for the right patient population. (A4 Version)
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
Drug discovery and development is a long and expensive process and over time has notoriously bucked Moore’s law that it now has its own law called Eroom’s Law named after it (the opposite of Moore’s). It is estimated that the attrition rate of drug candidates is up to 96% and the average cost to develop a new drug has reached almost $2.5 billion in recent years. One of the major causes for the high attrition rate is drug safety, which accounts for 30% of the failures.
Even if a drug is approved in market, it could be withdrawn due to safety problems. Therefore, evaluating drug safety extensively as early as possible is paramount in accelerating drug discovery and development. This talk provides a high-level overview of the current process of rational drug design that has been in place for many decades and covers some of the major areas where the application of AI, Deep learning and ML based techniques have had the most gains.
Specifically, this talk covers a variety of drug safety related AI and ML based techniques currently in use which can generally divided into 3 main categories:
1. Discovery,
2. Toxicity and Safety, and
3. Post-Market Monitoring.
We will address the recent progress in predictive models and techniques built for various toxicities. It will also cover some publicly available databases, tools and platforms available to easily leverage them.
We will also compare and contrast various modeling techniques including deep learning techniques and their accuracy using recent research. Finally, the talk will address some of the remaining challenges and limitations yet to be addressed in the area of drug discovery and safety assessment.
A travel time model for estimating the water budget of complex catchmentsRiccardo Rigon
This is the presentation given by Marialaura Bancheri for her admission to the final exam to achieve a Ph.D. in Environmental Engineering. It contains a synthesis of her studies about spatially integrated models of the water budget, and about travel time theory. A model structure is also presented preliminarily containing five reservoirs.
Breast cancer diagnosis via data mining performance analysis of seven differe...cseij
According to World Health Organization (WHO), breast cancer is the top cancer in women both in the
developed and the developing world. Increased life expectancy, urbanization and adoption of western
lifestyles trigger the occurrence of breast cancer in the developing world. Most cancer events are
diagnosed in the late phases of the illness and so, early detection in order to improve breast cancer
outcome and survival is very crucial.
In this study, it is intended to contribute to the early diagnosis of breast cancer. An analysis on breast
cancer diagnoses for the patients is given. For the purpose, first of all, data about the patients whose
cancers’ have already been diagnosed is gathered and they are arranged, and then whether the other
patients are in trouble with breast cancer is tried to be predicted under cover of those data. Predictions of
the other patients are realized through seven different algorithms and the accuracies of those have been
given. The data about the patients have been taken from UCI Machine Learning Repository thanks to Dr.
William H. Wolberg from the University of Wisconsin Hospitals, Madison. During the prediction process,
RapidMiner 5.0 data mining tool is used to apply data mining with the desired algorithms.
An overview on data mining designed for imbalanced datasetseSAT Journals
Abstract The imbalanced datasets with the classifying categories are not around equally characterized. A problem in imbalanced dataset occurs in categorization, where the amount of illustration of single class will be greatly lesser than the illustrations of the previous classes. Current existence brought improved awareness during implementation of machine learning methods to complex real world exertion, which is considered by several through imbalanced data. In machine learning the imbalanced datasets has become a critical problem and also usually found in many implementation such as detection of fraudulent calls, bio-medical, engineering, remote-sensing, computer society and manufacturing industries. In order to overcome the problems several approaches have been proposed. In this paper a study on Imbalanced dataset problem and examine various sampling method utilized in favour of evaluation of the datasets, moreover the interpretation methods are further suitable for imbalanced datasets mining. Keywords: Imbalance Problems, Imbalanced datasets, sampling strategies, Machine Learning.
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...ijtsrd
Customary characterization calculations can be constrained in their execution on exceedingly uneven informational collections. A famous stream of work for countering the substance of class inelegance has been the use of an assorted of inspecting methodologies. In this correspondence, we center on planning alterations neural system to properly handle the issue of class irregularity. We consolidate distinctive rebalance heuristics in ANN demonstrating, including cost delicate learning, and over and under testing. These ANN based systems are contrasted and different best in class approaches on an assortment of informational collections by utilizing different measurements, including G mean, region under the collector working trademark curve, F measure, and region under the exactness review curve. Numerous regular strategies, which can be classified into testing, cost delicate, or gathering, incorporate heuristic and task subordinate procedures. So as to accomplish a superior arrangement execution by detailing without heuristics and errand reliance, presently propose RBF based Network RBF NN . Its target work is the symphonious mean of different assessment criteria got from a perplexity grid, such criteria as affectability, positive prescient esteem, and others for negatives. This target capacity and its enhancement are reliably detailed on the system of CM KLOGR, in light of least characterization mistake and summed up probabilistic plunge MCE GPD learning. Because of the benefits of the consonant mean, CM KLOGR, and MCE GPD, RBF NN improves the multifaceted exhibitions in a very much adjusted way. It shows the definition of RBF NN and its adequacy through trials that nearly assessed RBF NN utilizing benchmark imbalanced datasets. Nitesh Kumar | Dr. Shailja Sharma "Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm Optimization" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25255.pdfPaper URL: https://www.ijtsrd.com/computer-science/other/25255/adaptive-classification-of-imbalanced-data-using-ann-with-particle-of-swarm-optimization/nitesh-kumar
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
Evaluation of Student's Perception in Using Electronic Dental Records at Riya...Dr. Faris Al-Masaari
Dentoplus, is a custom made software that is used by Riyadh Colleges of Dentistry and
Pharmacy(RCsDP) which have been in place since January 2013, The current study was
initiated in order to evaluate the electronic dental record system- Dentoplus installed in
the Colleges of Dentistry. The focus of this study was on student’s performance and
system efficiency, satisfaction level to the system as well as their perception of how
the system has impacted patient care.
Walden University
NURS 6050 Policy and Advocacy for Improving Population Health
Module 3
IntroductionResourcesDiscussionAssignmentMy Progress Tracker
NURS 6050 Policy and Advocacy for Improving Population Health | Module 3
IntroductionResourcesDiscussionAssignment☰Menu Walden University
NURS 6050 Policy and Advocacy for Improving Population Health
Module 3
IntroductionResourcesDiscussionAssignmentMy Progress Tracker
NURS 6050 Policy and Advocacy for Improving Population Health | Module 3
IntroductionResourcesDiscussionAssignment☰Menu× NURS 6050 Policy and Advocacy for Improving Population Health Back to Course Home Course Calendar Syllabus Course Information Resource List Support, Guidelines, and Policies Module 1 Module 2 Module 3 Module 4 Module 5 Module 6
Exit and return to the Blackboard App menu to access other tools, assessments, and content. Pull down, then click the "X" button at the top left corner of your mobile device.
Photo Credit: Getty Images/iStockphotoModule 3: Regulation (Weeks 5-6)
Laureate Education (Producer). (2018). Regulation [Video file]. Baltimore, MD: Author.
Rubic_Print_FormatCourse CodeClass CodeAssignment TitleTotal PointsLDR-463LDR-463-O501Topic 5 Journal Entry30.0CriteriaPercentageUnsatisfactory (0.00%)Less Than Satisfactory (65.00%)Satisfactory (75.00%)Good (85.00%)Excellent (100.00%)CommentsPoints EarnedContent100.0%Response to Journal Entry Prompt80.0%Response to the journal entry prompt is not present.Response to the journal entry prompt is incomplete or incorrect.Response to the journal entry prompt is complete but lacks relevant detail.Response to the journal entry prompt is thorough and contains substantial supporting details.Response to the journal entry prompt is complete and contains relevant supporting details.Mechanics of Writing includes spelling, punctuation, grammar, and language use.20.0%Frequent and repetitive mechanical errors distract the reader. Inconsistencies in language choice (register) or word choice are present. Sentence structure is correct but not varied.Surface errors are pervasive enough that they impede communication of meaning. Inappropriate word choice or sentence construction is used.Some mechanical errors or typos are present, but they are not overly distracting to the reader. Correct and varied sentence structure and audience-appropriate language are employed.Prose is largely free of mechanical errors, although a few may be present. The writer uses a variety of effective sentence structures and figures of speech.Writer is clearly in command of standard, written, academic English.Total Weightage100%
Walden University
NURS 6050 Policy and Advocacy for Improving Population Health ...
Improved vision-based diagnosis of multi-plant disease using an ensemble of d...IJECEIAES
Farming and plants are crucial parts of the inward economy of a nation, which significantly boosts the economic growth of a country. Preserving plants from several disease infections at their early stage becomes cumbersome due to the absence of efficient diagnosis tools. Diverse difficulties lie in existing methods of plant disease recognition. As a result, developing a rapid and efficient multi-plant disease diagnosis system is a challenging task. At present, deep learning-based methods are frequently utilized for diagnosing plant diseases, which outperformed existing methods with higher efficiency. In order to investigate plant diseases more accurately, this article addresses an efficient hybrid approach using deep learning-based methods. Xception and ResNet50 models were applied for the classification of plant diseases, and these models were merged using the stacking ensemble learning technique to generate a hybrid model. A multi-plant dataset was created using leaf images of four plants: black gram, betel, Malabar spinach, and litchi, which contains nine classes and 44,972 images. Compared to existing individual convolutional neural networks (CNN) models, the proposed hybrid model is more feasible and effective, which acquired 99.20% accuracy. The outcomes and comparison with existing methods represent that the designed method can acquire competitive performance on the multi-plant disease diagnosis tasks.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
An approach of cervical cancer diagnosis using class weighting and oversampli...TELKOMNIKA JOURNAL
Globally, cervical cancer caused 604,127 new cases and 341,831 deaths in 2020, according to the global cancer observatory. In addition, the number of cervical cancer patients who have no symptoms has grown recently. Therefore, giving patients early notice of the possibility of cervical cancer is a useful task since it would enable them to have a clear understanding of their health state. The use of artificial intelligence (AI), particularly in machine learning, in this work is continually uncovering cervical cancer. With the help of a logit model and a new deep learning technique, we hope to identify cervical cancer using patient-provided data. For better outcomes, we employ Keras deep learning and its technique, which includes class weighting and oversampling. In comparison to the actual diagnostic result, the experimental result with model accuracy is 94.18%, and it also demonstrates a successful logit model cervical cancer prediction.
HLT 362 V GCU Quiz 11. When a researcher uses a random samSusanaFurman449
HLT 362 V GCU
Quiz 1
1. When a researcher uses a random sample of 400 to make conclusions about a larger population, this is an example of:
· Descriptive statistics
· Demographics
· Inferential statistics
· Dependent variables
2. If a study is comparing number of falls by age, age is considered what type of variable?
· Interval
· Ordinal
· Ratio
· Nominal
3. Validity is:
· A data item, such as characteristics, numbers, properties, or quantities, that can be measured or counted.
· The extent to which an idea or measurement is well-founded and an accurate representation of the real world.
· A measurement level with equal distances between the points and a zero-starting point.
· Raw unorganized information from which conclusions can be made.
4. Data is defined as:
· A data item, such as characteristics, numbers, properties, or quantities, that can be measured or counted.
· The extent to which an idea or measurement is well-founded and an accurate representation of the real world.
· A measurement level with equal distances between the points and a zero-starting point.
· Raw unorganized information from which conclusions can be made.
5. The average of the collected data is known as:
· Mean
· Median
· Variance
· Range
6. The experimental or predictor variable is an example of:
· Extraneous variable
· Dependent variable
· Independent variable
· Nominal data
7. Level of measurement that defines the relationship between things and assigns an order or ranking to each thing is known as:
· Interval
· Ordinal
· Ratio
· Nominal
8. A variable is considered:
· A data item, such as characteristics, numbers, properties, or quantities, that can be measured or counted.
· A component of mathematics that looks at gathered data.
· Statistics designed to allow the researcher to infer characteristics regarding a population from sample population.
· External and internal influences within a study that can affect the validity and reliability of the outcomes.
9. External and internal influences within a study that can affect the validity and reliability of outcomes is called:
· Continuous variables
· Demographics
· Bias
· Standard deviation
10. The subset of the population to be studied is called:
· Sample
· Variable
· Population
· Demographic
Put the below in your own words into 1-2 paragraphs for the main conclusion and 1-2 paragraphs for the clinical application
Main conclusion:
The following is one example of a main conclusion and clinical applicability to assist you in formulating your take home message for the dissemination assignment. The details in these descriptions are intentionally detailed for your consideration. Do not include this level of detail in the dissemination assignment.
HPV study:
The Healthy People 2020 HPV vaccination goal of 80% of all United States adolescents[KG1] is not being met with current practices (citation). With insufficient vaccination, reduction in HPV-related disease ...
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second
most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast
cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data
processing tools, we tackled this disease analysis. Data mining is an important step of library discovery
where intelligent methods are used to detect patterns. Several clinical breast cancer studies were
conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier,
easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best
features and parameter values. Data mining is an important step of library discovery where intelligent
methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A
comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F.
Tree processes, i.e. 97.71%.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
FACIAL AGE ESTIMATION USING TRANSFER LEARNING AND BAYESIAN OPTIMIZATION BASED...sipij
Age estimation of unrestricted imaging circumstances has attracted an augmented recognition as it is
appropriate in several real-world applications such as surveillance, face recognition, age synthesis, access
control, and electronic customer relationship management. Current deep learning-based methods have
displayed encouraging performance in age estimation field. Males and Females have a variable type of
appearance aging pattern; this results in age differently. This fact leads to assuming that using gender
information may improve the age estimator performance. We have proposed a novel model based on
Gender Classification. A Convolutional Neural Network (CNN) is used to get Gender Information, then
Bayesian Optimization is applied to this pre-trained CNN when fine-tuned for age estimation task.
Bayesian Optimization reduces the classification error on the validation set for the pre-trained model.
Extensive experiments are done to assess our proposed model on two data sets: FERET and FG-NET. The
experiments’ result indicates that using a pre-trained CNN containing Gender Information with Bayesian
Optimization outperforms the state of the arts on FERET and FG-NET data sets with a Mean Absolute
Error (MAE) of 1.2 and 2.67 respectively.
FACIAL AGE ESTIMATION USING TRANSFER LEARNING AND BAYESIAN OPTIMIZATION BASED...
CariesICTAI2008
1. A Comparative Study of Machine Learning Techniques for Caries Prediction
Robson D. Montenegro, Adriano L. I. Oliveira, George G. Cabral
Department of Computing and Systems, Polytechnic School, Pernambuco State University
Rua Benfica, 455, Madalena, Recife PE, Brazil, 50.750-410
{adriano,rdm,ggc}@dsc.upe.br
Cintia R. T. Katz, Aronita Rosenblatt
Department of Preventive and Social Dentistry, Faculty of Dentistry, Pernambuco State University
Av. Gal. Newton Cavalcanti, 1.650 - Camaragibe, PE, Brazil, 54.753-220
cintiakatz@uol.com.br, rosen@reitoria.upe.br
Abstract
There are striking disparities in the prevalence of den-
tal disease by income. Poor children suffer twice as much
dental caries as their more affluent peers, but are less likely
to receive treatment. This paper presents an experimental
study of the application of machine learning methods to the
problem of caries prediction. For this paper a data set col-
lected from interviews with children under five years of age,
in 2006, in Recife, the capital of Pernambuco, a state in
northeast Brazil, was built. Four different data mining tech-
niques were applied to this problem and their results were
confronted in terms of the classification error and area un-
der the ROC curve (AUC). Results showed that the MLP
neural network classifier outperformed the other machine
learning methods employed in the experiments, followed by
the support vector machine (SVM) predictor. In addition,
the results also show that some rules (extracted by decision
tress) may be useful for understanding the most important
factors that influence the occurrence of caries in children.
1 Introduction
The early childhood caries is a disease that occurs in
young kids and is associated with malnutrition and inad-
equate eating habits during weaning. Dental caries is the
single most common chronic childhood disease - 5 times
more common than asthma and 7 times more common than
hay fever. This disease is considered a public health prob-
lem due to its impact in quality of life; it affects, almost
exclusively, children of social-economic groups less privi-
leged in developed and developing countries. Preceded by
enamel defects, the early childhood caries may have limited
its progress if detected early [22][21].
The increasing widespread use of information systems
in health and the considerable growth of data bases require
traditional manual data analyses to be adjusted to new, ef-
ficient computational models [13], those manual processes
easily break down while the size of the data grows and the
number of dimensions increases. Data Mining is a research
method that has been used to provide benefits to a large
number of fields of medicine, including diagnosis, progno-
sis and the treatment of diseases [2][3][17]. It encompasses
techniques such as machine learning and artificial neural
networks (ANNs), which have been successfully applied to
medical problems to predict clinical results [2][17].
In recent years, there has been a significant increase in
the use of technology in medicine and related areas. The
complexity and sophistication of the technologies often re-
quire the solution of decision problems using combinatorics
and optimization methods [3]. Despite the importante of
data mining and machine learning techniques, there remains
little application of these techniques to the field of den-
tistry. Recently, Oliveira et al. applied machine learning
techniques in the field of dentistry [6][18]. These works
aimed to predict the success of a dental implant by means
of machine learning techniques [6][18].
The purpose of this paper is to build robust models to
solve the problem of prediction of the presence of caries in
preschool children with ages less than five years in state
schools (attended by the low-income population) in Re-
cife, the capital of Pernambuco in the northeastern region
of Brazil. This paper also aims to extract and display, in
more friendly form, the rules, or factors, associated to the
caries prediction, in this specific case.
2. 2 Data Set Characteristics
A databank was constructed with information collected
from 3864 Brazilian preschool children with ages less than
five years. A cross-sectional study was conducted in state
schools (attended by the low-income population) in Re-
cife, the capital of Pernambuco in the northeastern region
of Brazil.
Recife is one of the three most important urban centers
of the northeastern region of Brazil. The population of the
city and its surrounding area is over 3 million people. The
city is divided into six administrative regions and has 153
schools run by the municipality, to which 4,787 4-year-old
children attend.
The questionnaires were completed during personal in-
terviews with each child’s mother. In every case, the ex-
aminer was blind to the child’s questionnaire data. Exam-
inations were performed under natural light, in the class-
room environment, using tongue blades, gloves and masks,
in compliance with the infection control protocol (Ministry
of Health, Brazil).
For each child, 193 (one hundred and ninety and three)
features were collected in the questionnaire. From this to-
tal, only sixteen features were considered significant to the
problem of caries prediction.
As shown in table 1, there is a significantly greater occur-
rence of healthy samples, thereby making the data set un-
balanced [14]. For this reason, in the experiments, only 998
samples were considered for the caries prediction. These
998 samples are equally divided in caries and healthy sam-
ples.
Table 1. Distribution of caries in the whole
dataset.
Class number of samples
Caries 499
Healthy 3365
Total 3864
The input variables (attributes) considered in our prob-
lem are:
1. Gender: male/female.
2. Age in months.
3. Parent’s opinion about the oral health of the child (ex-
cellent, good, regular, bad, very bad)
4. Has the child already had a toothache ? (yes/no)
5. Family income (1 to 7, or more) in minimum wages
(yes/no)
6. Child has already gone to the dentist and a caries was
diagnosed (yes/no)
7. Child has never gone to the dentist for another reason
(yes/no)
8. Child has already gone to the dentist (yes/no)
9. Child has already visited the dentist for having a
toothache (yes/no)
10. Presence of failure in the enamel (yes/no)
11. Presence of fistula (yes/no)
12. Political-administrative region (from 1 to 6)
13. Child has never gone to the dentist for access reason
(yes/no)
14. Child has already gone to the dentist for prevention
reason (yes/no)
15. Child has never gone to the dentist for financial ques-
tions (yes/no)
The output variable is:
1. Presence of caries (yes/no)
3 The Classifiers Evaluated
In this section we briefly review the four classification
techniques used in this work, namely, (1) decision trees, (2)
MLP neural networks, (3) kNN, and (4) support vector ma-
chines.
Decision Trees are statistical models for classification
and data prediction. These models take a ”divide-and-
conquer” approach: a complex problem is decomposed in
simpler sub-models and, recursively, this technique is ap-
plied to each sub-problem [10].
For this work we have chosen one of the most popular
algorithms for building decision trees, the C4.5 [20]. C4.5 is
a software extension of the basic ID3 algorithm designed by
Quinlan to address some issues not dealt with by ID3, such
as avoiding over fitting the data, determining how deeply to
grow a decision tree, improving computational efficiency,
etc. Quinlan’s C4.5 has a factor named confidence factor,
denoted by C, that is used for pruning. In general, smaller
values of C yields more pruning. For the experiments we
have varied the value of the confidence factor to obtain a
more accurate model of classification.
The MLP neural network (Multi Layer Perceptron) de-
rives from the Perceptron model of neural networks. Unlike
the basic perceptron, MLPs are able to to solve non-linearly
3. separable problems. For this work we have chosen the back-
propagation learning algorithm for training MLP neural net-
works.
The MLP network is trained by adapting the weights.
During training the network output is compared with a de-
sired output. The error, that is, the difference between these
two signals is used to adapt the weights. The rate of adapta-
tion is controlled by the learning rate. A high learning rate
will make the network adapt its weights quickly, but will
make it potentially unstable. Therefore it is recommended
to use small learning rates in practical applications.
kNN is a classical prototype-based (or memory-based)
classifier, which is often used in real-world applications due
to its simplicity [24]. Despite its simplicity, it has achieved
considerable classification accuracy on a number of tasks
and is therefore quite often used as a basis for comparison
with novel classifiers.
Support vector machine (SVM) is a recent technique for
classification and regression which has achieved remarkable
accuracy in a number of important problems [4], [23], [5],
[1]. SVM is based on the principle of structural risk mini-
mization (SRM), which states that, in order to achieve good
generalization performance, a machine learning algorithm
should attempt to minimize the structural risk instead of
the empirical risk [9], [1]. The empirical risk is the error in
the training set, whereas the structural risk considers both
the error in the training set and the complexity of the class
of functions used to fit the data. Despite its popularity in
the machine learning and pattern recognition communities,
a recent study has shown that simpler methods, such as kNN
and neural networks, can achieve performance comparable
to or even better than SVMs in some classification and re-
gression problems [16].
4 Experiments
The simulations were carried out using the Weka data
mining tool, which includes several pre-processing and
classification methods [25].
We have used 10-fold cross-validation to assess the gen-
eralization performance as well as to compare the classi-
fiers considered in this article. In 10-fold cross-validation
(CV), a given dataset is divided into ten subsets. A classi-
fier is trained using a subset formed by joining nine of these
subsets and tested by using the one left aside [7]. This is
done ten times each employing a different subset as the test
set and computing the test set error, Ei. Finally, the cross-
validation error is computed as the mean over the ten errors
Ei, 1 < i < 10. It is important to emphasize that all the
simulations reported here used stratified CV, whereby the
subsets are formed by using the same frequency distribu-
tion of patterns of the original [25].
The performance measures used to compare the classi-
fiers are (1) the classification error, and (2) the area under
the ROC curve (AUC) [9], [11], [12]. ROC curves origi-
nated from signal detection theory and are more frequently
used in the case of one-class classification or classification
with two classes, which is the case of our problem [18][8].
In the ROC curve, the x-axis represents the PFA (Prob-
ability of False Alarm), which identifies normal patterns
wrongly classified as novelties; the y-axis represents the PD
(Probability of Detection), which identifies the likelihood
of patterns of the novelty class being recognized correctly.
The area under the ROC curve (AUC) summarizes the ROC
curve and is another way to compare classifiers other than
the accuracy, according to Huang and Ling [18]. In com-
parison with others classifiers, the best classifier is the one
that obtains an AUC more close to 1.
Aiming to select the attributes from the dataset with
greater significance to the problem we have used In-
foGainAttributeEval, as the attribute evaluator, with the
search method Ranker. The InfoGainAttributeEval evalu-
ates the worth of an attribute by measuring the information
gain with respect to the class. The Ranker ranks attributes
by their individual evaluations using a threshold by which
attributes can be discarded. For our experiments we varied
the thresholds by which attributes can be discarded from
10−4
to 10−1
.
4.1 Results and Discussion
We carried out experiments aiming to analyze the per-
formance for the different selected attributes (see table ??).
Table 2 shows the results obtained using the whole input
feature vector (15 input variables), that is, without feature
selection. In these experiment we have achieved a better re-
sult with the MLP method, followed by the SVM (in terms
of 10-fold cross-validation error). In terms of AUC, MLP
have achieved better results, followed by kNN.
For the decision trees, the results demonstrate that the
parameter C has a great influence on the performance of
the classifier, whereas the error has increased 5.01% from
the C = 0.25 to C = 0.001. For C = 0.25, the decision tree
has created 78 nodes while the other decision tree using C
= 0.001 has created only 5 nodes. Fig. 1 shows the simple
model created by the C4.5 algorithm for C = 0.001 without
feature selection. These results match with AUC results, for
C = 0.25 AUC is better than the AUC for C = 0.001.
Among all the experiments carried out using feature se-
lection the best results were found by the InfoGainAttribu-
teEval threshold = 10−4
, which means using only two input
attributes. The two attributes selected by InfoGainAttribu-
teEval were age in months and opinion of the responsible
about the oral health of the child.
Table 3 shows the results obtained using the InfoGainAt-
4. Table 2. Caries prediction results without feature selection (15 input attributes)
Classifier 10-fold cross-validation error AUC
kNN(k = 19) 26.75% 0.8178
C4.5 (C = 0.25) 25.95% 0.7985
C4.5 (C = 0.001) 30.96% 0.7193
MLP (hidden layer units = 2, learning rate = 0.01, epochs = 500) 22.75% 0.8452
SVM (C = 1, σ = 0.1) 23.65% 0.7635
Figure 1. Decision Tree for C = 0.001.
tributeEval threshold = 10−4
. With only two attributes we
have improved the results obtained by kNN and decision
trees. Conversely, the results of the MLP and SVM meth-
ods were inferior to those with 15 input variables. In these
experiments we, as in the experiments without feature se-
lection, have achieved a better result with the MLP method,
followed by the kNN in terms of both performance criteria,
namely, the classification error and the AUC value.
Using feature selection the performance of both decision
trees models achieved a discrete performance improvement.
As a multidisciplinary work, this paper have chosen deci-
sion trees as one of the methods to treat this problem by its
ability to rules extraction of the problem. For a dentist it is
easier to use the results provided by decision trees than to
use the results of classifiers such as MLPs, which are harder
to interpret.
5 Conclusion
The early childhood caries is considered a public health
problem which occurs often in children of social-economic
groups less privileged. In this work we have compared the
performance of four different classifiers applied to the prob-
lem of caries prediction. For this problem, we also per-
formed a feature selection in the dataset aiming to retrieve
the attributes more relevant to the task of caries prediction.
The results have shown that the best model for caries
prediction was obtained by MLP Neural Networks, which
achieved a 10-fold cross validation error rate of 22.75%,
without feature selection. Using the InfoGainAttributeEval
as feature selection method, the MLP and SVM methods
had a discrete performance loss whereas the decision trees
(C = 0.001 and C = 0.25) and the kNN achieved a discrete
improvement in their performance.
From the results obtained in this work we can see that
children with ages from twenty three months are more
caries prone. The results also show that the family income,
if the child had already a toothache and if the child had al-
ready a caries diagnoses, influences the occurrence of the
disease. The results also show that children already diag-
nosed as caries carrier has presented recurrence; this makes
us conclude that the treatment is not achieving a needed ef-
ficiency in the reeducation of the child’s oral hygiene.
References
[1] V. D. S. A. Advanced support vector machines and kernel
methods. Neurocomputing, 55(1-2):5–20, 2003.
[2] S. R. Bhatikar, C. DeGroff, and R. L. Mahajan. A classi-
fier based on the artificial neural network approach for car-
diologic auscultation in pediatrics. Artificial Intelligence in
Medicine, 33(3):251–260, 2005.
[3] T.-C. Chen and T.-C. Hsu. A GAs based approach for min-
ing breast cancer pattern. Expert Syst. Appl, 30(4):674–681,
2006.
[4] C. Cortes and V. Vapnik. Support vector networks. Machine
Learning, 20:1–25, 1995.
[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Sup-
port Vector Machines. Cambridge University Press, 2000.
[6] A. L. I. de Oliveira, C. Baldisserotto, and J. Baldisserotto. A
comparative study on machine learning techniques for pre-
diction of success of dental implants. In A. F. Gelbukh,
A. de Albornoz, and H. Terashima-Mar´ın, editors, MICAI,
volume 3789 of Lecture Notes in Computer Science, pages
939–948. Springer, 2005.
[7] D. Delen, G. Walker, and A. Kadam. Predicting breast can-
cer survivability: a comparison of three data mining meth-
ods. Artificial Intelligence in Medicine, 34(2):113–127,
2005.
[8] N. M. Farsi and F. S. Salama. Sucking habits in saudi chil-
dren: prevalence, contributing factors and effects on the pri-
mary dentition. Pediatr Dent, 19(1):28–33, 1997.
[9] T. Fawcett. An introduction to ROC analysis. Pattern Recog-
nition Letters, 27(8):861–874, June 2006.
[10] J. Gama. Functional trees. Machine Learning, 55(3):219–
250, 2004.
5. Table 3. Caries prediction results for InfoGainAttributeEval threshold = 10−4
(2 input attributes).
Classifier 10-fold cross-validation error AUC
kNN(k = 11) 24,65% 0.8136
C4.5 (C = 0.25) 25,15% 0.8011
C4.5 (C = 0.001) 29,76% 0.7458
MLP (hidden layer units = 2, learning rate = 0.01, epochs = 500) 24,75% 0.8223
SVM (C = 100, σ = 0.1) 25,05% 0.7495
Figure 2. Decision Tree for C = 0.25 with feature selection and InfoGainAttributeEval threshold = 10−4
.
[11] J. Huang and C. X. Ling. Using AUC and accuracy in eval-
uating learning algorithms. IEEE Trans. Knowl. Data Eng,
17(3):299–310, 2005.
[12] T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-
Machado. The use of receiver operating characteristic
curves in biomedical informatics. Journal of Biomedical In-
formatics, 38(5):404–415, 2005.
[13] N. Lavraˇc. Machine learning for data mining in medicine. In
W. Horn, Y. Shahar, G. Lindberg, S. Andreassen, and J. Wy-
att, editors, Proceedings of the Joint European Conference
on Artificial Intellingence in Medicine and Medical Decision
Making (AIMDM-99), volume 1620 of LNAI, pages 47–64,
Berlin, June 20–24 1999. Springer.
[14] Y. Lu, H. Guo, and L. Feldkamp. Robust neural learning
from unbalanced data samples. In IEEE International Con-
ference on Neural Networks (IJCNN’98), volume III, pages
III–1816–III–1821, Anchorage, AK, July 1998. IEEE.
[15] W. P. W. S. McCulloch. A logical calculus of ideas im-
manent in nervous activity. Bulletin of Mathematical Bio-
physics, 5:115–133, 1943.
[16] D. Meyer, F. Leisch, and K. Hornik. The support vector ma-
chine under test. Neurocomputing, 55(1-2):169–186, 2003.
[17] B. A. Mobley, E. Schechter, W. E. Moore, P. A. McKee,
and J. E. Eichner. Neural network predictions of significant
coronary artery stenosis in men. Artificial Intelligence in
Medicine, 34(2):151–161, 2005.
[18] A. L. I. Oliveira, C. Baldisserotto, and J. Baldisserotto. A
comparative study on support vector machine and construc-
tive RBF neural network for prediction of success of den-
tal implants. In A. Sanfeliu and M. Lazo-Cort´es, editors,
CIARP, volume 3773 of Lecture Notes in Computer Science,
pages 1015–1026. Springer, 2005.
[19] J. R. Quinlan. Induction of decision trees. In J. W. Shavlik
and T. G. Dietterich, editors, Readings in Machine Learning.
Morgan Kaufmann, 1990. Originally published in Machine
Learning 1:81–106, 1986.
[20] J. R. Quinlan. C4.5: Programs for Machine Learning. Mor-
gan Kaufmann, San Mateo, CA., 1993.
[21] S. Reisine and J. Douglass. Jm. psychosocial and behavioral
issues in early childhood caries, 1998.
[22] A. Rosenblatt and A. Zarzar. Breast feeding and early child-
hood caries: an assessment among brazilian infants. Inter-
national Journal of Paediatric Dentistry, 14:439–450, 2004.
[23] J. S. Taylor and N. Cristianini. Kernel Methods for Pattern
Analysis. Cambridge University Press, 2004.
[24] A. Webb. Statistical Pattern Recognition. Wiley, 2002.
[25] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann, San Francisco, 2000.
[26] I. H. Witten and E. Frank. Data mining: practical machine
learning tools and techniques with Java implementations.
SIGMOD, 31(1):76–77, Mar. 2002.