This document presents a new algorithm called UDT-CDF for building decision trees to classify uncertain numerical data. It improves on previous algorithms like UDT that were based on probability density functions (PDFs). The key aspects of the new algorithm are:
1. It uses cumulative distribution functions (CDFs) rather than PDFs to represent uncertain numerical attributes, since CDFs provide more complete probability information.
2. It splits data at decision tree nodes based on the CDF, placing data with values covering the split point into both branches weighted by the CDF.
3. Experimental results show the new CDF-based algorithm achieves more accurate classifications and is more computationally efficient than the PDF-based UDT algorithm,
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document discusses techniques for fast decision tree learning on microarray data. It introduces using attribute histograms to speed up the process of finding the best split points for decision tree learning. It also discusses optimizations for speeding up leave-one-out cross validation by reusing subtrees from previous runs. Experimental results on three microarray datasets show speedups of 150-400% from these techniques. Attribute pruning based on histogram indices is also introduced to further improve speed without loss of accuracy.
The delegation visited the evicted Dale Farm Travellers site one year after the eviction to assess conditions. They found around 20-30 caravans still living in poor conditions on the road entrance, with lack of services and health/sanitation concerns. Residents expressed worries about another winter in such conditions. Their health needs were not being fully met since eviction and midwife/health visitor services were suspended or reluctant to visit the site. The Environment Agency also tested the evicted site for asbestos and other pollutants, indicating a risk to public health from the excavation works.
Este documento trata sobre la robótica educativa. Explica que la robótica educativa utiliza robots para desarrollar habilidades prácticas y didácticas en los estudiantes. También describe un robot educativo llamado EducaBot, el cual puede ser programado para realizar movimientos básicos y recorridos, y luego ser programado para captar información del entorno y responder a los sensores. El objetivo final es enseñar conceptos de mecánica, electrónica, informática y control a través de la construcción y programación de robots.
Richtige Sitzhaltung - Bedienungsanleitungen für Bürostühle in Zusammenhang mit Bürotischen.Die richtige Sitzhaltung auf dem Bürostuhl in Verbindung mit dem Bürotisch.Bedienungsanleitungen.Für weitere Informationen wenden Sie sich bitte an uns vist http://www.wilhelm-schuster.de
Henry Ford introduced the Model T car in order to make automobiles affordable for everyday Americans. The Model T was cheaply produced using assembly line manufacturing and the standardization of parts. This reduced costs and allowed Ford to continuously sell the cars for low prices between 1909 to 1928. Mass production and standardization stimulated the economy by creating jobs in related industries like steel, oil, rubber and more. As more Americans could now afford cars, this launched an economic cycle of prosperity.
This document presents a new algorithm called UDT-CDF for building decision trees to classify uncertain numerical data. It improves on previous algorithms like UDT that were based on probability density functions (PDFs). The key aspects of the new algorithm are:
1. It uses cumulative distribution functions (CDFs) rather than PDFs to represent uncertain numerical attributes, since CDFs provide more complete probability information.
2. It splits data at decision tree nodes based on the CDF, placing data with values covering the split point into both branches weighted by the CDF.
3. Experimental results show the new CDF-based algorithm achieves more accurate classifications and is more computationally efficient than the PDF-based UDT algorithm,
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
Data mining, the extraction of hidden information from large databases, is to predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. Data-Mining classification techniques deals with determining to which group each data instances are associated with. It can deal with a wide variety of data so that large amount of data can be involved in processing. This paper deals with analysis on various data mining classification techniques such as Decision Tree Induction, Naïve Bayes , k-Nearest Neighbour (KNN) classifiers in mammographic mass dataset.
This document discusses techniques for fast decision tree learning on microarray data. It introduces using attribute histograms to speed up the process of finding the best split points for decision tree learning. It also discusses optimizations for speeding up leave-one-out cross validation by reusing subtrees from previous runs. Experimental results on three microarray datasets show speedups of 150-400% from these techniques. Attribute pruning based on histogram indices is also introduced to further improve speed without loss of accuracy.
The delegation visited the evicted Dale Farm Travellers site one year after the eviction to assess conditions. They found around 20-30 caravans still living in poor conditions on the road entrance, with lack of services and health/sanitation concerns. Residents expressed worries about another winter in such conditions. Their health needs were not being fully met since eviction and midwife/health visitor services were suspended or reluctant to visit the site. The Environment Agency also tested the evicted site for asbestos and other pollutants, indicating a risk to public health from the excavation works.
Este documento trata sobre la robótica educativa. Explica que la robótica educativa utiliza robots para desarrollar habilidades prácticas y didácticas en los estudiantes. También describe un robot educativo llamado EducaBot, el cual puede ser programado para realizar movimientos básicos y recorridos, y luego ser programado para captar información del entorno y responder a los sensores. El objetivo final es enseñar conceptos de mecánica, electrónica, informática y control a través de la construcción y programación de robots.
Richtige Sitzhaltung - Bedienungsanleitungen für Bürostühle in Zusammenhang mit Bürotischen.Die richtige Sitzhaltung auf dem Bürostuhl in Verbindung mit dem Bürotisch.Bedienungsanleitungen.Für weitere Informationen wenden Sie sich bitte an uns vist http://www.wilhelm-schuster.de
Henry Ford introduced the Model T car in order to make automobiles affordable for everyday Americans. The Model T was cheaply produced using assembly line manufacturing and the standardization of parts. This reduced costs and allowed Ford to continuously sell the cars for low prices between 1909 to 1928. Mass production and standardization stimulated the economy by creating jobs in related industries like steel, oil, rubber and more. As more Americans could now afford cars, this launched an economic cycle of prosperity.
Este manual es útil e indispensable para el uso del "Package TesSurvRec_1.2.1" de CRAN. Importante para estadístico, médicos, farmacéuticos, seguros, bancos, ingenieros, psicólogos, astrónomos, entre otras profesiones. Son pruebas estadísticas que se utilizan para medir diferencias entre funciones del análisis de supervivencias de grupos de poblaciones que manifiestan eventos recurrentes.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
This chapter discusses different techniques for exploring and visualizing data to better understand its characteristics. It describes the different types of data objects and attributes as well as basic statistical measures like mean, median, and standard deviation that can characterize a dataset's central tendency and dispersion. Visualization techniques covered include histograms, boxplots, scatterplots, parallel coordinates, Chernoff faces, and landscapes that can reveal patterns, relationships, and outliers in the data.
This chapter discusses different techniques for exploring and visualizing data to better understand its characteristics. It describes the different types of data objects and attributes as well as basic statistical measures like mean, median, variance, and standard deviation that can characterize a dataset's central tendency and dispersion. Visualization techniques covered include histograms, boxplots, scatterplots, parallel coordinates, Chernoff faces, and landscapes that can reveal patterns, relationships, and outliers in the data.
Human: Thank you for the summary. Summarize the following document in 3 sentences or less:
[DOCUMENT]:
1. Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems
This document provides an overview of Bayesian networks through a 3-day tutorial. Day 1 introduces Bayesian networks and provides a medical diagnosis example. It defines key concepts like Bayes' theorem and influence diagrams. Day 2 covers propagation algorithms, demonstrating how evidence is propagated through a sample chain network. Day 3 will cover learning from data and using continuous variables and software. The overview outlines propagation algorithms for singly and multiply connected graphs.
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
This chapter discusses getting to know your data through data mining concepts and techniques. It covers data objects and attribute types, basic statistical descriptions of data like mean and standard deviation, visualizing data through histograms and scatter plots, measuring data similarity, and different types of data sets. The goal is to provide qualitative overviews and insights into data to find patterns, trends, relationships and irregularities.
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Statistics is the science of dealing with numbers and data. It involves collecting, summarizing, presenting, and analyzing data. There are four main steps: data collection, summarization by removing unwanted data and classifying/tabulating, presentation with diagrams/graphs/tables, and analysis using measures like average, dispersion, and correlation. Descriptive statistics summarize and describe data, while inferential statistics allow generalizing from samples to populations. Common descriptive statistics include measures of central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution properties. Inferential statistics techniques like hypothesis testing and ANOVA are used to make inferences about populations based on samples.
This document discusses data and data preprocessing in data mining. It defines what data is, including data objects and attributes. It describes different attribute types like nominal, binary, ordinal, interval-scaled and ratio-scaled numeric attributes. It also discusses measuring the central tendency of data using the mean, median and mode. Additionally, it covers measuring data distribution through variance, standard deviation and z-scores. Finally, it briefly introduces measuring data similarity and dissimilarity, as well as an overview of data preprocessing.
Data mining techniques in data mining with examplesmqasimsheikh5
This document provides an overview of data mining concepts and techniques for understanding data. It discusses different types of data sets and attributes, basic statistical descriptions for analyzing data distributions and outliers, various data visualization techniques for exploring patterns and relationships, and measures for determining data similarity and dissimilarity.
This document presents a technique for retrieving contextually relevant prior radiology reports to help radiologists with diagnosis. It uses a semantic vector approach with an ontology to capture relationships between concepts in reports. Explicit feedback from radiologists is also incorporated using an algorithm to personalize relevance. An evaluation with domain experts found the semantic approach improved retrieval over baselines. Analysis of report similarities confirmed the approach increased differences for related reports while decreasing them for unrelated reports. Overall, the technique aims to better identify useful information from prior exams to support radiologists.
Heart Disease Prediction Using Data Mining TechniquesIJRES Journal
There are huge amounts of data in the medical industry which is not processed properly and hence cannot be used effectively in making decisions. We can use data mining techniques to mine these patterns and relationships. This research has developed a prototype Heart Disease Prediction using data mining techniques, namely Neural Network, K-Means Clustering and Frequent Item Set Generation. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease to be established. Performance of these techniques is compared through sensitivity, specificity and accuracy. It has been observed that Artificial Neural Networks outperform K Means clustering in all the parameters i.e. Sensitivity, Specificity and Accuracy.
Identification of Differentially Expressed Genes by unsupervised Learning Methodpraveena06
Abstract-Microarrays are one of the latest breakthroughs in experimental molecular biology that allow monitoring of gene expression of tens of thousands of genes in parallel. Micro array analysis include many stages. Extracting samples from the cells, getting the gene expression matrix from the raw data, and data normalization which are low level analysis.Cluster analysis for genome-wide expression data from DNA micro array data is described as a high level analysis that uses standard statistical algorithms to arrange genes according to similarity patterns of expression levels. This paper presents a method for the number of clusters using the divisive hierarchical clustering, and k-means clustering of significant genes. The goal of this method is to identify genes that are strongly associated with disease in 12607 genes. Gene filtering is applied to identify the clusters. k-means shows that about four to seven genes or less than one percent of the genes account for the disease group which are the outliers, more than seventy percent falls as undefined group. The hierarchical clustering dendo gram shows clusters at two levels which shows again less than one percent of the genes are differentially expressed.
The document describes a lab experiment analyzing gene expression data from human fibroblasts in response to serum using microarray analysis. The aims are to analyze the gene expression data using Excel and the ArrayTrack workbench. Key steps include importing microarray data into Excel and pre-treating the data by centering and scaling. ArrayTrack is then used to analyze the data through descriptive statistics, exploring gene expression profiles of gene lists, and using the significance analysis of microarrays (SAM) tool. Additional online databases like Gene Atlas and ArrayExpress are queried to find expression profiles and experimental data for a specific gene, APT13A2, under different conditions.
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
This document discusses various tools for analyzing and visualizing multivariate biological data. It describes univariate, bivariate, and multivariate analysis methods. Univariate analysis examines one variable at a time, bivariate examines two variables jointly, and multivariate examines multiple variables together. Dimensionality reduction techniques like principal component analysis (PCA) and partial least squares (PLS) projection can be used to visualize high-dimensional data. Networks can represent relationships among objects and identify patterns in complex data. Integrative modeling approaches provide a holistic view of biological systems from multivariate data.
Comparative Analysis of Weighted Emphirical Optimization Algorithm and Lazy C...IIRindia
Health care has millions of centric data to discover the essential data is more important. In data mining the discovery of hidden information can be more innovative and useful for much necessity constraint in the field of forecasting, patient’s behavior, executive information system, e-governance the data mining tools and technique play a vital role. In Parkinson health care domain the hidden concept predicts the possibility of likelihood of the disease and also ensures the important feature attribute. The explicit patterns are converted to implicit by applying various algorithms i.e., association, clustering, classification to arrive at the full potential of the medical data. In this research work Parkinson dataset have been used with different classifiers to estimate the accuracy, sensitivity, specificity, kappa and roc characteristics. The proposed weighted empirical optimization algorithm is compared with other classifiers to be efficient in terms of accuracy and other related measures. The proposed model exhibited utmost accuracy of 87.17% with a robust kappa statistics measurement and roc degree indicated the strong stability of the model when compared to other classifiers. The total penalty cost generated by the proposed model is less when compared with the penalty cost of other classifiers in addition to accuracy and other performance measures.
Este manual es útil e indispensable para el uso del "Package TesSurvRec_1.2.1" de CRAN. Importante para estadístico, médicos, farmacéuticos, seguros, bancos, ingenieros, psicólogos, astrónomos, entre otras profesiones. Son pruebas estadísticas que se utilizan para medir diferencias entre funciones del análisis de supervivencias de grupos de poblaciones que manifiestan eventos recurrentes.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
This chapter discusses different techniques for exploring and visualizing data to better understand its characteristics. It describes the different types of data objects and attributes as well as basic statistical measures like mean, median, and standard deviation that can characterize a dataset's central tendency and dispersion. Visualization techniques covered include histograms, boxplots, scatterplots, parallel coordinates, Chernoff faces, and landscapes that can reveal patterns, relationships, and outliers in the data.
This chapter discusses different techniques for exploring and visualizing data to better understand its characteristics. It describes the different types of data objects and attributes as well as basic statistical measures like mean, median, variance, and standard deviation that can characterize a dataset's central tendency and dispersion. Visualization techniques covered include histograms, boxplots, scatterplots, parallel coordinates, Chernoff faces, and landscapes that can reveal patterns, relationships, and outliers in the data.
Human: Thank you for the summary. Summarize the following document in 3 sentences or less:
[DOCUMENT]:
1. Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems
This document provides an overview of Bayesian networks through a 3-day tutorial. Day 1 introduces Bayesian networks and provides a medical diagnosis example. It defines key concepts like Bayes' theorem and influence diagrams. Day 2 covers propagation algorithms, demonstrating how evidence is propagated through a sample chain network. Day 3 will cover learning from data and using continuous variables and software. The overview outlines propagation algorithms for singly and multiply connected graphs.
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
This chapter discusses getting to know your data through data mining concepts and techniques. It covers data objects and attribute types, basic statistical descriptions of data like mean and standard deviation, visualizing data through histograms and scatter plots, measuring data similarity, and different types of data sets. The goal is to provide qualitative overviews and insights into data to find patterns, trends, relationships and irregularities.
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Statistics is the science of dealing with numbers and data. It involves collecting, summarizing, presenting, and analyzing data. There are four main steps: data collection, summarization by removing unwanted data and classifying/tabulating, presentation with diagrams/graphs/tables, and analysis using measures like average, dispersion, and correlation. Descriptive statistics summarize and describe data, while inferential statistics allow generalizing from samples to populations. Common descriptive statistics include measures of central tendency (mean, median, mode), variability (range, variance, standard deviation), and distribution properties. Inferential statistics techniques like hypothesis testing and ANOVA are used to make inferences about populations based on samples.
This document discusses data and data preprocessing in data mining. It defines what data is, including data objects and attributes. It describes different attribute types like nominal, binary, ordinal, interval-scaled and ratio-scaled numeric attributes. It also discusses measuring the central tendency of data using the mean, median and mode. Additionally, it covers measuring data distribution through variance, standard deviation and z-scores. Finally, it briefly introduces measuring data similarity and dissimilarity, as well as an overview of data preprocessing.
Data mining techniques in data mining with examplesmqasimsheikh5
This document provides an overview of data mining concepts and techniques for understanding data. It discusses different types of data sets and attributes, basic statistical descriptions for analyzing data distributions and outliers, various data visualization techniques for exploring patterns and relationships, and measures for determining data similarity and dissimilarity.
This document presents a technique for retrieving contextually relevant prior radiology reports to help radiologists with diagnosis. It uses a semantic vector approach with an ontology to capture relationships between concepts in reports. Explicit feedback from radiologists is also incorporated using an algorithm to personalize relevance. An evaluation with domain experts found the semantic approach improved retrieval over baselines. Analysis of report similarities confirmed the approach increased differences for related reports while decreasing them for unrelated reports. Overall, the technique aims to better identify useful information from prior exams to support radiologists.
Heart Disease Prediction Using Data Mining TechniquesIJRES Journal
There are huge amounts of data in the medical industry which is not processed properly and hence cannot be used effectively in making decisions. We can use data mining techniques to mine these patterns and relationships. This research has developed a prototype Heart Disease Prediction using data mining techniques, namely Neural Network, K-Means Clustering and Frequent Item Set Generation. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict the likelihood patients getting a heart disease. It enables significant knowledge, e.g. patterns, relationships between medical factors related to heart disease to be established. Performance of these techniques is compared through sensitivity, specificity and accuracy. It has been observed that Artificial Neural Networks outperform K Means clustering in all the parameters i.e. Sensitivity, Specificity and Accuracy.
Identification of Differentially Expressed Genes by unsupervised Learning Methodpraveena06
Abstract-Microarrays are one of the latest breakthroughs in experimental molecular biology that allow monitoring of gene expression of tens of thousands of genes in parallel. Micro array analysis include many stages. Extracting samples from the cells, getting the gene expression matrix from the raw data, and data normalization which are low level analysis.Cluster analysis for genome-wide expression data from DNA micro array data is described as a high level analysis that uses standard statistical algorithms to arrange genes according to similarity patterns of expression levels. This paper presents a method for the number of clusters using the divisive hierarchical clustering, and k-means clustering of significant genes. The goal of this method is to identify genes that are strongly associated with disease in 12607 genes. Gene filtering is applied to identify the clusters. k-means shows that about four to seven genes or less than one percent of the genes account for the disease group which are the outliers, more than seventy percent falls as undefined group. The hierarchical clustering dendo gram shows clusters at two levels which shows again less than one percent of the genes are differentially expressed.
The document describes a lab experiment analyzing gene expression data from human fibroblasts in response to serum using microarray analysis. The aims are to analyze the gene expression data using Excel and the ArrayTrack workbench. Key steps include importing microarray data into Excel and pre-treating the data by centering and scaling. ArrayTrack is then used to analyze the data through descriptive statistics, exploring gene expression profiles of gene lists, and using the significance analysis of microarrays (SAM) tool. Additional online databases like Gene Atlas and ArrayExpress are queried to find expression profiles and experimental data for a specific gene, APT13A2, under different conditions.
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
This document discusses various tools for analyzing and visualizing multivariate biological data. It describes univariate, bivariate, and multivariate analysis methods. Univariate analysis examines one variable at a time, bivariate examines two variables jointly, and multivariate examines multiple variables together. Dimensionality reduction techniques like principal component analysis (PCA) and partial least squares (PLS) projection can be used to visualize high-dimensional data. Networks can represent relationships among objects and identify patterns in complex data. Integrative modeling approaches provide a holistic view of biological systems from multivariate data.
Comparative Analysis of Weighted Emphirical Optimization Algorithm and Lazy C...IIRindia
Health care has millions of centric data to discover the essential data is more important. In data mining the discovery of hidden information can be more innovative and useful for much necessity constraint in the field of forecasting, patient’s behavior, executive information system, e-governance the data mining tools and technique play a vital role. In Parkinson health care domain the hidden concept predicts the possibility of likelihood of the disease and also ensures the important feature attribute. The explicit patterns are converted to implicit by applying various algorithms i.e., association, clustering, classification to arrive at the full potential of the medical data. In this research work Parkinson dataset have been used with different classifiers to estimate the accuracy, sensitivity, specificity, kappa and roc characteristics. The proposed weighted empirical optimization algorithm is compared with other classifiers to be efficient in terms of accuracy and other related measures. The proposed model exhibited utmost accuracy of 87.17% with a robust kappa statistics measurement and roc degree indicated the strong stability of the model when compared to other classifiers. The total penalty cost generated by the proposed model is less when compared with the penalty cost of other classifiers in addition to accuracy and other performance measures.
Similar to M Sc Thesis Presentation Eitan Lavi (20)
Comparative Analysis of Weighted Emphirical Optimization Algorithm and Lazy C...
M Sc Thesis Presentation Eitan Lavi
1. Medical Engineering Data Analysis Framework for Clinical Decision Support for Pediatrics Neuro-Development Disorders eiTanLaVi Advisors: Prof. ShmuelEinav, Biomedical Engineering Department, Tel-Aviv University Prof. Yuval Shahar, Department of Information Systems Engineering, Ben-Gurion University Dr. Mitchell Schertz, Institute for Child Development ,KupatHolimMeuhedet, Central Region, Herzeliya
2.
3. NDD – Current Clinical Practice Diagnosis is mainly performed based on an external evaluation of the child The pediatrician relies only on her own (available memory of his) past experience Inadequate Human ability to retrieve prior experience in an unbiased, complete and objective fashion. Need for Experience Based Decision Support System
6. Problem Space Institute for Child Development ,KupatHolimMeuhedet, Central Region Collaborating physician – Dr. Mitchell Schertz, head of the institute. Has been building a case-base since 2000 The case base currently holds 1941 non-active children, 2477 active children, 8022 cases. Much of the case information is in free-text form => making this also a TCBR project.
7. Building the Data Set 465 x 3 1474 x 60 4582 x 19 1143 x 69 437 x 3 5107 x 2 1560 x 43 13133 x 2 8022 x 153 4826 x 2 5227 x 3 n x m == # observations x # attributes 6861 x 1
8. Preprocessing and Transformations 182 attributes Attribute Type Mapping Date Numeric Binary Textual 8022 Neuroi.ds X Case-Base Factor Dirty Factor Free Text 235 Diagnoses Binary Diagnoses Vector for Case i 8022 Neuroi.ds Y Diagnoses
9.
10. Similarity Metrics Date Distance = month gap Numeric/Binary Distance = normalized Euclidean NA Distance = 0.5 if both fields are NA, -0.5 otherwise Textual Distance = Cosine Similarity of Latent Semantic Analysis (LSA) derived document vectors
11. LSA - Advantages Strictly mathematical approach - inherently independent of language. Able to perform cross linguistic concept searching and example-based categorization. Automatically adapts to new and changing terminology Has been shown to be very tolerant of noise Deals effectively with sparse, ambiguous and contradictory data Text doesn't have to be in sentence form
13. Weighted Term-Document Matrix A Local Term Weight: lij–relative frequency of the term iin a document j A Global Term Weight: gi–relative frequency of the term iwithinthe entire corpus
30. Example of diagnosis prediction scores for a specific {test case,retrievalmethod,K value} combinations. In actuality, 32 such graphs were generated for each of the 350 test cases The real diagnoses for this test case were: (1) DELAY IN DEVELOPMENTAL MILESTONES (2) GROSS MOTOR (3) NORMAL EARLY INTELLIGENCE
31. Example of diagnosis prediction scores for a specific {test case,retrievalmethod,K value} combinations. In actuality, 32 such graphs were generated for each of the 350 test cases The real diagnoses for this test case were: (1) DELAY IN DEVELOPMENTAL MILESTONES (2) GROSS MOTOR (3) NORMAL EARLY INTELLIGENCE
32. Example of diagnosis prediction scores for a specific {test case,retrievalmethod,K value} combinations. In actuality, 32 such graphs were generated for each of the 350 test cases The real diagnoses for this test case were: (1) DELAY IN DEVELOPMENTAL MILESTONES (2) GROSS MOTOR (3) NORMAL EARLY INTELLIGENCE
33. Example of diagnosis prediction scores for a specific {test case,retrievalmethod,K value} combinations. In actuality, 32 such graphs were generated for each of the 350 test cases The real diagnoses for this test case were: (1) DELAY IN DEVELOPMENTAL MILESTONES (2) GROSS MOTOR (3) NORMAL EARLY INTELLIGENCE
34. Prediction evaluation matrix for a specific test case and retrieval method For each test case, prediction vectors were generated using 8 retrieval & prediction methods, for 8 different K values (total 64 per test case) For each test case, prediction vectors were generated using 8 retrieval & prediction methods, for 8 different K values (total 64 per test case)
35.
36. SAR = 1/3 * (Accuracy + Area under the ROC curve + Root mean-squared error ) = Score combining performance measures of different characteristics, in the attempt of creating a more "robust" measure (cf. Caruana R., ROCAI2004).
37. F measure – Can help to dynamically choose threshold
45. Thank You Prof. ShmuelEinav Prof. Yuval Shahar Prof. OdedMaimon Dr. Mitchell Schertz The Yitzhak and Chaya Weinstein Research Institute
Editor's Notes
The NDD specialists often don't follow any preset rules or logical algorithms in making their decisions, and thus the field of Machine Learning is a natural realm from which to approach the classification task at hand:
the required task is more complex then the primary classification types discussed above, and can be termed as Multi-Class Multi-Classification (predict 1 or more classes from a pool of multiple classes)Another important distinction in the NDD domain is that the NDD specialist is the one producing the function mapping from features to diagnoses, through his diagnosis decisions, which are imperfect, inaccurate and inconsistent [9]. Since NDD is a domain lacking a deep clinical understanding or a clear knowledge structure, the physician hasn't necessarily labeled the cases in the case-base with the "correct" classes, nor is it promised that highly similar cases will be given similar diagnoses [2],[9]. We are looking, therefore, to incorporate some aspects of the supervised approach (utilizing the outputs of prior cases in predicting an output for a new case), without resorting to the need to fully deduce a general function mapping from the input objects to the output space (which would completely rely on the outputs' integrity)Moreover, we are looking to also incorporate into our methodology some aspects of the unsupervised approach, primarily the ability to discover patterns and clinically-similar groups in the case-base without using any prior knowledge regarding how the NDD specialists decided to label (diagnose) each case. This would allow us to find, for each new case, the cases most clinically-similar to it basically – we are using clustering (unsupervised learning) to find the clinically similar cases to a test case, and then multilable, multiclass classification (supervised learning) only on the retrieved similar cases – using their physician-given labels to make a prediction.
Preprocessing X Matrix: Attribute types:(a) empty : < 8 non-NA features removed(b) Date : regex + >90% of feature column entries matching the regex (allowing for non-pattern dates) transformed into 2 new attributes: month, year(c) Numeric: coeercision to character and back to numeric, if Nas produced by this < 16 , termed as numeric feature kept in numeric coeercsion(d) binary: multiple conditions + fuzzy detection consolidated to a single form of “true” and a single form of “false”(e) clean factor: under 25 categories (and no match for the previous feature types) no action(f) dirty factor: no match for previous + average string length < 20 characters 20 most frequeny levels remained, rest were termed “misc levels”(g) free text: no match for the previous no actionMissing dataConformed to NA statusPreprocessing Y Matrix: Originally in the .mdb format – each row was a general i.d , a feature (column) gave the respective neuroi.d (there could be two rows with the same neuro. Id), and for each type of diagnosis, a comment or numeric marker was given in a respective feature. binary diagnosis vector
* In this study – the weights were automatically calculated using a simple algorithmic approach* Other studies have used domain specific ontology - to give different weights to different terms, according to their clinical significance.
* Pij = relative probability Each such entropy is further normalized by log(n), n being the length of the corpus (the number of documents). This normalization was originally devised to give equal treatment to different size corpuses, but since in this project all textual attributes contain the same N.cases number of documents, this produces little effect. A possible improvement to implement in future versions is to replace this with a local normalization by the length of the document, so that the summing up of entropies will be of normalized values in respect to the document length. האנטרופיה היא מדד של פיזור של השימוש במונח ברחבי הקורפוס. בסוף, הסכום של כל pij שווה ל-1, אך ככל שהם מפוזרים יותר טוב ברחבי הקורפוס, כך הציון הגלובלי של המונח יגדל.
This is used in the VECTOR SPACE MODEL
Empirical studies show that truncating the lower singular values can enact noise reduction, and thus the algorithms transformed all singular values in S which were below a certain threshold (set at 10-3) to 0.
Attribute clinical weights:High similarity : if > 80% of cases have similarity to the test case > 0.8 divide by 2Average Input Length in Textual Attribute: if > 30 characters : multiply by 2Test case value for the attribute is NA divide by 3
In choosing the test cases, however, the distribution of diagnoses in the Y matrix was examined. Inspecting the prior probabilities of diagnoses in the case base shows that there are several diagnoses which occur only once in the entire Y matrix, while others occur in the singles, tens, hundreds and thousands.
350 case indexes in the final subsetFor each test case, a diagnoses probability prediction vector was outputted, for each of the combinatorial instance of <Retrieval Method (4 types), Reuse & Adapt Method (2 types), K value (8 values)>. That is, 64 diagnoses prediction probability vectors were generated for each test case.
The above 5 graph types were produced for each combination of K, Retrieval Method and Reuse Scheme (i.e. for the 8 X 4 X 2 = 64 distinct combinations). That is, 320 distinct graphs were produced to graphically assess the aggregated results for all test cases.
RMSE = sqrt(1/(P+N) sum_i (y_i - _i)^2) = Root-mean-squared error = summing for all diagnoses (for all i values), an aggregated normalized sum of the individual errors between the predictions and real values of the diagnoses vector. For each diagnosis, the error can be either 0 if the prediction is correct or 1 if the prediction is wrong. Since the output of RMSE is just a cutoff-independent scalar, this measure cannot be combined with other measures into a parametric curve. Accuracy = P( = Y) = estimated as: (TP + TN)/(P + N) = the number of correct predictions divided by the total number of diagnoses predicted = the probability of the algorithm to predict correctly = the rate of correct predictions attained by the algorithm
F measure = Weighted harmonic mean of precision (P) and recall (R) = 1/ (alpha*1/P + (1-alpha)*1/R) (van Rijsbergen, 1979) = If alpha=1/2, the mean is balanced. Sensitivity = Recall = TPrate = P( = + | Y = +) = estimated as: TP/P = True Positive Rate = number of true positives divided by the number of overall positives in the real diagnoses vector from the Y matrix = the algorithm's probability of predicting correctly which diagnoses the patient does have.Precision = PPV = P(Y = + | = +) = estimated as: TP/tstdP = TP/(TP+FP)= Positive Predictive Value = the number of true positives divided by the total number of diagnoses predicted by the algorithm as positives = the probability of a positive "1" prediction to be correct.
P value of the AUC ROC: tests the null hypothesis that the area under the curve really equals 0.50. In other words, the P value answers this question:What is the probability to receive the obtained AUC ROC (or higher) in case the diagnosis algorithm was no better than flipping a coin?
Another reason for choosing ML as an approach for developing a CDSS in NDD, is the need for future scalability – no need for per clinic rules modification