The document summarizes a machine learning project to predict Parkinson's disease. It discusses cleaning and exploring the data, which includes speech attribute data from 240 subjects. Feature importance analysis found attributes like Delta3 and MFCCs to be important. Various machine learning models were tested, with random forest performing best at 97.2% accuracy after cross-validation. The conclusion discusses further optimizing models and collecting more data. Lessons learned note challenges of limited labeled data and importance of domain knowledge.
"Optimization of patient throughput and wait time in emergency departments (ED) is an important task for hospital systems. For that reason, Emergency Severity Index (ESI) system for patient triage was introduced to help guide manual estimation of acuity levels, which is used by nurses to rank the patients and organize hospital resources. However, despite improvements that it brought to managing medical resources, such triage system greatly depends on nurse’s subjective judgment and is thus prone to human errors. Here, we propose a novel deep model based on the word attention mechanism designed for predicting a number of resources an ED patient would need.
Our approach incorporates routinely available continuous and nominal (structured) data with medical text (unstructured) data, including patient’s chief complaint, past medical history, medication list, and nurse assessment collected for 338,500 ED visits over three years in a large urban hospital. Using both structured and unstructured data, the proposed approach achieves the AUC of 88% for the task of identifying resource intensive patients, and the accuracy of 44% for predicting exact category of number of resources, giving an estimated lift over nurses’ performance by 16% in accuracy. Furthermore, the attention mechanism of the proposed model provides interpretability by assigning attention scores for nurses’ notes which is crucial for decision making and implementation of such approaches in the real systems working on human health."
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...gerogepatton
Parkinson’s disease is a complex chronic neurodegenerative disorder of the central nervous system. One of the common symptoms for the Parkinson’s disease subjects, is vocal performance degradation. Patients usually advised to follow personalized rehabilitative treatment sessions with speech experts. Recent research trends aim to investigate the potential of using sustained vowel phonations for replicating the speech experts’ assessments of Parkinson’s disease subjects’ voices. With the purpose of improving the accuracy and efficiency of Parkinson’s disease treatment, this article proposes a two-stage diagnosis model to evaluate an LSVT dataset. Firstly, we propose a modified minimum Redundancy-Maximum Relevance (mRMR) feature selection approach, based on Cuckoo Search and Tabu Search to reduce the features numbers. Secondly, we apply simple random sampling technique to dataset to increase the samples of the minority class. Promisingly, the developed approach obtained a classification Accuracy rate of 95% with 24 features by 10-fold CV method.
"Optimization of patient throughput and wait time in emergency departments (ED) is an important task for hospital systems. For that reason, Emergency Severity Index (ESI) system for patient triage was introduced to help guide manual estimation of acuity levels, which is used by nurses to rank the patients and organize hospital resources. However, despite improvements that it brought to managing medical resources, such triage system greatly depends on nurse’s subjective judgment and is thus prone to human errors. Here, we propose a novel deep model based on the word attention mechanism designed for predicting a number of resources an ED patient would need.
Our approach incorporates routinely available continuous and nominal (structured) data with medical text (unstructured) data, including patient’s chief complaint, past medical history, medication list, and nurse assessment collected for 338,500 ED visits over three years in a large urban hospital. Using both structured and unstructured data, the proposed approach achieves the AUC of 88% for the task of identifying resource intensive patients, and the accuracy of 44% for predicting exact category of number of resources, giving an estimated lift over nurses’ performance by 16% in accuracy. Furthermore, the attention mechanism of the proposed model provides interpretability by assigning attention scores for nurses’ notes which is crucial for decision making and implementation of such approaches in the real systems working on human health."
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
A MODIFIED MAXIMUM RELEVANCE MINIMUM REDUNDANCY FEATURE SELECTION METHOD BASE...gerogepatton
Parkinson’s disease is a complex chronic neurodegenerative disorder of the central nervous system. One of the common symptoms for the Parkinson’s disease subjects, is vocal performance degradation. Patients usually advised to follow personalized rehabilitative treatment sessions with speech experts. Recent research trends aim to investigate the potential of using sustained vowel phonations for replicating the speech experts’ assessments of Parkinson’s disease subjects’ voices. With the purpose of improving the accuracy and efficiency of Parkinson’s disease treatment, this article proposes a two-stage diagnosis model to evaluate an LSVT dataset. Firstly, we propose a modified minimum Redundancy-Maximum Relevance (mRMR) feature selection approach, based on Cuckoo Search and Tabu Search to reduce the features numbers. Secondly, we apply simple random sampling technique to dataset to increase the samples of the minority class. Promisingly, the developed approach obtained a classification Accuracy rate of 95% with 24 features by 10-fold CV method.
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Pradeep Redddy Raamana
Presentation delivered by Pradeep Reddy Raamana at 2016 international workshop on Pattern Recognition in Neuroimaging on the topic of histogram-weighted cortical thickness networks for the detection of Alzheimer's disease.
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
In the present study, the abilities of three classification methods of data mining namely artificial
neural networks with feed-forward back propagation algorithm, J48 decision tree method and
logistic regression analysis are compared in a medical real dataset. The prediction of
malignancy in suspected thyroid tumour patients is the objective of the study. The accuracy of
the correct predictions (the minimum error rate), the amount of time consuming in the
modelling process and the interpretability and simplicity of the results for clinical experts are
the factors considered to choose the best method
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
When deep learners change their mind learning dynamics for active learningDevansh16
Abstract:
Active learning aims to select samples to be annotated that yield the largest performance improvement for the learning algorithm. Many methods approach this problem by measuring the informativeness of samples and do this based on the certainty of the network predictions for samples. However, it is well-known that neural networks are overly confident about their prediction and are therefore an untrustworthy source to assess sample informativeness. In this paper, we propose a new informativeness-based active learning method. Our measure is derived from the learning dynamics of a neural network. More precisely we track the label assignment of the unlabeled data pool during the training of the algorithm. We capture the learning dynamics with a metric called label-dispersion, which is low when the network consistently assigns the same label to the sample during the training of the network and high when the assigned label changes frequently. We show that label-dispersion is a promising predictor of the uncertainty of the network, and show on two benchmark datasets that an active learning algorithm based on label-dispersion obtains excellent results.
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
StreamAnalytix sponsored a meetup on “Anomaly Detection Techniques and Implementation using Apache Spark” which took place on Tuesday December 5, 2017 at Larkspur Landing Milpitas Hotel, Milpitas, CA. The meetup was led by Maxim Shkarayev, Lead Data Scientist, Impetus Technologies along with Punit Shah, Solution Architect, StreamAnalytix and Anand Venugopal, Product Head & AVP, StreamAnalytix, who introduced and summarized the vast field of Anomaly Detection and its applications in various industry problems. The speakers at the event also offered a structured approach to choose the right anomaly detection techniques based on specific use-cases and data characteristics which was followed by a demonstration of some real-world anomaly detection use-cases on Apache Spark based analytics platform.
neuropredict: a proposal and a tool towards standardized and easy assessment ...Pradeep Redddy Raamana
Proper application of machine learning to accurately evaluate the accuracy of biomarkers is challenging and error-prone for those without expertise in machine learning or programming. We offer an easy to use tool which implements the best practices and produces a comprehensive yet clinically-relevant report when comparing several biomarkers or different methods/studies. It is called neuropredict, which is open source and applicable to any domain whose biomarkers can be represented by numbers.
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
Through real-world examples, this presentation teaches strategies for choosing appropriate outcome measures, methods for analysis and randomization, and sample sizes as well as tips for collecting the right data to answer your scientific questions.
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Pradeep Redddy Raamana
Presentation delivered by Pradeep Reddy Raamana at 2016 international workshop on Pattern Recognition in Neuroimaging on the topic of histogram-weighted cortical thickness networks for the detection of Alzheimer's disease.
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...cscpconf
In the present study, the abilities of three classification methods of data mining namely artificial
neural networks with feed-forward back propagation algorithm, J48 decision tree method and
logistic regression analysis are compared in a medical real dataset. The prediction of
malignancy in suspected thyroid tumour patients is the objective of the study. The accuracy of
the correct predictions (the minimum error rate), the amount of time consuming in the
modelling process and the interpretability and simplicity of the results for clinical experts are
the factors considered to choose the best method
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
YouTube video: https://www.youtube.com/watch?v=Ao-19L0sLOI
SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation
Vajira Thambawita, Pegah Salehi, Sajad Amouei Sheshkal, Steven A. Hicks, Hugo L.Hammer, Sravanthi Parasa, Thomas de Lange, Pål Halvorsen, Michael A. Riegler
Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2107.00471 [eess.IV]
(or arXiv:2107.00471v1 [eess.IV] for this version)
Reach out to me:
Check out my other articles on Medium. : https://machine-learning-made-simple....
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn: https://www.linkedin.com/in/devansh-d...
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://devanshacc.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y
Get a free stock on Robinhood: https://join.robinhood.com/fnud75
Propose a Enhanced Framework for Prediction of Heart DiseaseIJERA Editor
Heart disease diagnosis requires more experience and it is a complex task. The Heart MRI, ECG and Stress Test etc are the numbers of medical tests are prescribed by the doctor for examining the heart disease and it is the way of tradition in the prediction of heart disease. Today world, the hidden information of the huge amount of health care data is contained by the health care industry. The effective decisions are made by means of this hidden information. For appropriate results, the advanced data mining techniques with the information which is based on the computer are used. In any empirical sciences, for the inference and categorisation, the new mathematical techniques to be used called Artificial neural networks (ANNs) it also be used to the modelling of the real neural networks. Acting, Wanting, knowing, remembering, perceiving, thinking and inferring are the nature of mental phenomena and these can be understand by using the theory of ANN. The problem of probability and induction can be arised for the inference and classification because these are the powerful instruments of ANN. In this paper, the classification techniques like Naive Bayes Classification algorithm and Artificial Neural Networks are used to classify the attributes in the given data set. The attribute filtering techniques like PCA (Principle Component Analysis) filtering and Information Gain Attribute Subset Evaluation technique for feature selection in the given data set to predict the heart disease symptoms. A new framework is proposed which is based on the above techniques, the framework will take the input dataset and fed into the feature selection techniques block, which selects any one techniques that gives the least number of attributes and then classification task is done using two algorithms, the same attributes that are selected by two classification task is taken for the prediction of heart disease. This framework consumes the time for predicting the symptoms of heart disease which make the user to know the important attributes based on the proposed framework.
When deep learners change their mind learning dynamics for active learningDevansh16
Abstract:
Active learning aims to select samples to be annotated that yield the largest performance improvement for the learning algorithm. Many methods approach this problem by measuring the informativeness of samples and do this based on the certainty of the network predictions for samples. However, it is well-known that neural networks are overly confident about their prediction and are therefore an untrustworthy source to assess sample informativeness. In this paper, we propose a new informativeness-based active learning method. Our measure is derived from the learning dynamics of a neural network. More precisely we track the label assignment of the unlabeled data pool during the training of the algorithm. We capture the learning dynamics with a metric called label-dispersion, which is low when the network consistently assigns the same label to the sample during the training of the network and high when the assigned label changes frequently. We show that label-dispersion is a promising predictor of the uncertainty of the network, and show on two benchmark datasets that an active learning algorithm based on label-dispersion obtains excellent results.
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxImpetus Technologies
StreamAnalytix sponsored a meetup on “Anomaly Detection Techniques and Implementation using Apache Spark” which took place on Tuesday December 5, 2017 at Larkspur Landing Milpitas Hotel, Milpitas, CA. The meetup was led by Maxim Shkarayev, Lead Data Scientist, Impetus Technologies along with Punit Shah, Solution Architect, StreamAnalytix and Anand Venugopal, Product Head & AVP, StreamAnalytix, who introduced and summarized the vast field of Anomaly Detection and its applications in various industry problems. The speakers at the event also offered a structured approach to choose the right anomaly detection techniques based on specific use-cases and data characteristics which was followed by a demonstration of some real-world anomaly detection use-cases on Apache Spark based analytics platform.
neuropredict: a proposal and a tool towards standardized and easy assessment ...Pradeep Redddy Raamana
Proper application of machine learning to accurately evaluate the accuracy of biomarkers is challenging and error-prone for those without expertise in machine learning or programming. We offer an easy to use tool which implements the best practices and produces a comprehensive yet clinically-relevant report when comparing several biomarkers or different methods/studies. It is called neuropredict, which is open source and applicable to any domain whose biomarkers can be represented by numbers.
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Accounting for variance in machine learning benchmarksDevansh16
Accounting for Variance in Machine Learning Benchmarks
Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Naz Sepah, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Dmitriy Serdyuk, Tal Arbel, Chris Pal, Gaël Varoquaux, Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
Through real-world examples, this presentation teaches strategies for choosing appropriate outcome measures, methods for analysis and randomization, and sample sizes as well as tips for collecting the right data to answer your scientific questions.
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
Predictive Analysis of Breast Cancer Detection using Classification AlgorithmSushanti Acharya
Dissertation project titled “Predictive analysis of Breast Cancer detection using Classification”. For the research conducted, Breast Cancer Wisconsin Diagnostics dataset was used for analysis. Using R language machine learning model was designed based on various algorithms and the derived results were then visualized to present the most accurate model of them all (SVM in this case).
Batch -13.pptx lung cancer detection using transfer learninghananth1513
Embedded systems
Embedded systems are special-purpose computing systems embedded in application environments or in other computing systems and provide specialized support. The decreasing cost of processing power, combined with the decreasing cost of memory and the ability to design low-cost systems on chip, has led to the development and deployment of embedded computing systems in a wide range of application environments. Examples include network adapters for computing systems and mobile phones, control systems for air conditioning, industrial systems, and cars,
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the
availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection
based ensemble learning models is to classify the high dimensional data with high computational efficiency
and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray
datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...ijsc
As the size of the biomedical databases are growing day by day, finding an essential features in the disease prediction have become more complex due to high dimensionality and sparsity problems. Also, due to the availability of a large number of micro-array datasets in the biomedical repositories, it is difficult to
analyze, predict and interpret the feature information using the traditional feature selection based classification models. Most of the traditional feature selection based classification algorithms have computational issues such as dimension reduction, uncertainty and class imbalance on microarray datasets. Ensemble classifier is one of the scalable models for extreme learning machine due to its high efficiency, the fast processing speed for real-time applications. The main objective of the feature selection based ensemble learning models is to classify the high dimensional data with high computational efficiency and high true positive rate on high dimensional datasets. In this proposed model an optimized Particle swarm optimization (PSO) based Ensemble classification model was developed on high dimensional microarray datasets. Experimental results proved that the proposed model has high computational efficiency compared to the traditional feature selection based classification models in terms of accuracy , true positive rate and error rate are concerned.
The goal of this project is to find the best tool for predicting the life expectancy of people with Hepatitis B. Different Machine Learning methods have been completely studied and various Machine Learning methods have been carried out by different experimenters. Hepatitis B is a worldwide disease with a high mortality rate. Different methods have been used by different researchers to predict the life expectancy of Hepatitis B patients. The Machine Learning models and algorithms such as the Classification model, Logistic Regression model, Recursive Feature Elimination Algorithm, Cirrhosis Mortality model, Extreme Gradient Boosting, Random Forest, Decision Tree have been utilized by different researchers to predict the life expectancy of Hepatitis B patients. Some algorithms and models showed very interesting and proving results whereas some were not that good. Area Under Curve analysis was used to assess the estimation of various models. The AUROC value of the PSO model was minimal, while the ADT model had the highest accuracy. XGBoost showed appropriate predictive performance. All other models showed good calibration.
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Bigfinite
Maximize Your Understanding of Operational Realities in Manufacturing with Predictive Insights using Big Data, Artificial Intelligence, and Pharma 4.0
by Toni Manzano, PhD, Co-founder and CSO, Bigfinite
PDA Annual Meeting 2020
Discover our students' innovative project on breast cancer prediction using artificial intelligence techniques. This project leverages advanced analytics algorithms to analyze medical data and predict the likelihood of breast cancer in patients. Gain insights into early detection methods, risk factors, and the potential impact on healthcare outcomes. To learn more, do checkout https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Agenda
• Executive Summary
• Project Overview
• Process Involved
• Project Summary
• Data Summary Statistics
• Data Visualizations
• Feature Importance
• ML Methodologies
• Conclusion & Future Efforts
• Lessons Learned
2
3. Executive Summary
OBJECTIVE
APPROACH
RECOMMENDATION
• Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not
• Post Data Cleaning use different Machine Learning Classification Techniques
like Decision Tree, Naïve Bayes, Random Forest, etc to develop a
classification model
• Identify the important features that can help detect Parkinson Disease and
cure it before it gets worse.
3
4. Project Overview & Background
• Issue: Parkinson’s disease is the second most prevalent neurodegenerative
disorder after Alzheimer’s, affecting more than 10 million people worldwide.
Symptoms include frozen facial features, slowness of movement, tremor, etc.
• Current Accuracy: A study from National Institute of Neurological Disorders finds
that early diagnosis is only 53% accurate.
• Goal: Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not based on certain attributes of speech such as relative
jitter, shimmer, MFCC coefficients ,etc.
4
5. Process Involved
Data Cleaning Feature Engineering Data Visualization Model Building
• Checking for Missing values. • Check the categorical
features if any for Label
encoding.
• Visualize the target variable
and relationship with certain
independent attributes.
• Start with Basic Tree Based
model, and then ensemble
techniques to predict the
target variable.
• Checking imbalance if any
for the target variable.
• Check the correlation
between different
independent features.
• Visualize important features
as well after feature
importance test .
• Run cross validation and
check feature importance,
accuracy score, confusion
matrix, etc.
• Use fillna method to detect
the missing values and
impute it accordingly.
• Use Label Encoder method to
do Label Encoding of
categorical columns.
• Use seaborn, matplotlib
libraries to understand
different relationships.
• Use different models from
sklearn library like Decision
Tree, Naïve Bayes, Random
Forest.
• Check the distribution of
different classes. If they are
imbalanced then use
synthetic oversampling or
undersampling
techniques(SMOTE,
ADASYN) to create balanced
training dataset.
• Use corr method to check for
Pearson coefficient of greater
than 0.8 or less than -0.8.
• Use distplot, histograms,
scatterplot to understand
important features and their
relationship with the target
variable.
• Check for accuracy score,
bias-variance trade off,
confusion matrix to validate
if the model is properly
fitted.
What
How
5
6. Project Summary
Overall Results – Before Hyper-parameter Tuning & CV
Decision Tree
Accuracy: 100%
Confusion Matrix:
Random Forest
Accuracy: 94.4%
Confusion Matrix:
Gaussian Naïve Bayes
Accuracy: 87.5%
Confusion Matrix:
Predicted 0 1
Actual
0 39 0
1 0 33
Predicted 0 1
Actual
0 38 1
1 8 25
Predicted 0 1
Actual
0 39 0
1 2 31
6
Decision Tree
Accuracy: 100%
Over fitting.
Overall Results – After Hyper-parameter Tuning & CV
Gaussian Naïve Bayes
Accuracy: 87.5%
Lower than Random Forest
Random Forest
Accuracy: 97.2%
CV Technique: Grid Search
Bootstrap: True
Impurity Criterion : Gini
7. Understanding the Data
7
1. ID: Subjects's identifier.
2. Recording: Number of the recording.
3. Status: 0=Healthy; 1=PD
4. Gender: 0=Man; 1=Woman
5. Pitch local perturbation measures: relative jitter (Jitter_rel), absolute jitter (Jitter_abs),
relative average perturbation (Jitter_RAP), and pitch perturbation quotient
(Jitter_PPQ).
6. Amplitude perturbation measures: local shimmer (Shim_loc), shimmer in dB
(Shim_dB), 3-point amplitude perturbation quotient (Shim_APQ3), 5-point amplitude
perturbation quotient (Shim_APQ5), and 11-point amplitude perturbation quotient
(Shim_APQ11).
7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0-
500 Hz (HNR05), in 0-1500 Hz (HNR15), in 0-2500 Hz (HNR25), in 0-3500 Hz (HNR35),
and in 0-3800 Hz (HNR38).
8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (MFCC0,
MFCC1,..., MFCC12) and their derivatives (Delta0, Delta1,..., Delta12).
9. Recurrence period density entropy (RPDE).
10. Detrended fluctuation analysis (DFA).
11. Pitch period entropy (PPE).
12. Glottal-to-noise excitation ratio (GNE).
8. Data & Summary Statistics
• We have used Parkinson’s disease data for this project. It has around 48 attributes for determining if patient
has the disease or not.
• Number of records : 240
• Number of attributes: 48
• Key attributes: Delta3, MFCC4, HNR15
Key Attributes Mean Std Comments
Delta3 1.34 0.19 Delta3 is the derivative of MFCC and is one of the key attribute with a mean 1.34
MFCC4 1.355 0.21 MFCC of order 4 has mean of 1.35 and std of 0.21
HNR15 63.67 15.62 Harmonic to noise ratio measure has a mean of 64 and std of 15
8
9. Data Visualizations
Status Vs HNR15 Status Vs Shim_loc
People who have Parkinson Disease have
relatively low Harmonic to Noise Ratio
measure in the frequency band of 0-1500
Hz.
People who have Parkinson Disease have
relatively high local shimmer.
9
10. Data Visualizations - Multicollinearity
Checking collinearity of different
features. Most of the Mel frequency
cepstral coefficients MFC are highly
correlated with Harmonic-to-noise ratio
measures. We have removed features
whose Pearson coefficient is higher then
0.8. There are no features showing very
high negative correlation.
10
11. Outlier Detection
11
• Removed outlier using z-score and kept a threshold of 3 standard deviation.
• I got 3 observations that had multiple columns with z-score greater than even 5, so removed those.
• Post Outlier Detection and removal, there were 237 records.
• After removing Outliers, when I ran Random Forest model, I didn’t get better results in terms of Precision
and Accuracy. Number of False Positives increased which is not better for problems of healthcare
background.
Before Outlier removal:
Accuracy:94.4%
False Positives: 0
After Outlier removal:
Accuracy:93.5%
False Positives:1
12. Feature Importance
Based on sklearn feature_importance
following were top 5 features:
• Delta3
• MFCC3
• MFCC9
• MFCC8
• HNR05
Mel frequency cepstral coefficient of order
3,8,9 shows that they are important in
determining whether a person has
Parkinson Disease or is healthy.
12
13. Machine Learning Methodology
Random Forest uses boosting approach
with multiple trees which improves the
model and gives better results.
Decision Tree
Random Forest involves more tree shuffling
and hence the accuracy was 94% without
cross val and 97% after cross val
Why
Accuracy
Conclusion
Random Forest
Comments
Decision Tree Classifier uses Tree based
models bagging approach to classify different
class instances.
Decision Tree Accuracy was 100%. Clearly it
was over fitted.
Decision Tree was clearly over fitted so we
tried Naïve Bayes which was not over fitted
and gave 87.5%
Random Forest gave 97.2% with 39 True
Positives and 31 True Negatives
Pre-pruning and Post-pruning could minimize
over-fitting and will create more accurate
model
Random Forest was the best model out of
the three ML algorithms that we ran based
on Accuracy, TPs and FPs.
13
14. Conclusion and Future Efforts
• Random Forest was the best model out of all three ML models
• Cross Validation using Grid Search improved the accuracy giving the best parameters.
• We could further optimize Decision Tree using Pre-Pruning or Post-Pruning techniques to minimize over-
fitting.
• We could use other ML techniques like XG Boost, Light GBM which could give more accurate results
because of the learning rate they use to train every tree instance.
• In further iterations of the model, we could check skewness, check for outliers to further filter the
dataset and get more accurate results.
14
15. Lessons Learned
• Labelled Data:
Availability of enough number of samples is the key to any ML algorithms, but especially Healthcare industry face challenges
on patient population or patient data. Even sometimes when the data is available it is protected by laws & regulations. In this
project models were built with around 240 records, which is quite less for making predictions and deploying it for practical
purposes. To overcome patient data challenge, National Institute of Health is running multiple programs via grant CTSA to
build patient population database and supplement researches via informatics and AI.
• Domain/Functional Knowledge:
Any Data Science project is dependent on the domain knowledge of the Data Scientist. Despite availability of millions of
records and thousands of attributes, it is critical to have domain knowledge which helps in establishing certain hypothesis,
example, high blood pressure drives risk of heart attack. Lack of domain knowledge was another challenge in this project
which I realized after 30% of work.