The NDD specialists often don't follow any preset rules or logical algorithms in making their decisions, and thus the field of Machine Learning is a natural realm from which to approach the classification task at hand:
the required task is more complex then the primary classification types discussed above, and can be termed as Multi-Class Multi-Classification (predict 1 or more classes from a pool of multiple classes)Another important distinction in the NDD domain is that the NDD specialist is the one producing the function mapping from features to diagnoses, through his diagnosis decisions, which are imperfect, inaccurate and inconsistent . Since NDD is a domain lacking a deep clinical understanding or a clear knowledge structure, the physician hasn't necessarily labeled the cases in the case-base with the "correct" classes, nor is it promised that highly similar cases will be given similar diagnoses ,. We are looking, therefore, to incorporate some aspects of the supervised approach (utilizing the outputs of prior cases in predicting an output for a new case), without resorting to the need to fully deduce a general function mapping from the input objects to the output space (which would completely rely on the outputs' integrity)Moreover, we are looking to also incorporate into our methodology some aspects of the unsupervised approach, primarily the ability to discover patterns and clinically-similar groups in the case-base without using any prior knowledge regarding how the NDD specialists decided to label (diagnose) each case. This would allow us to find, for each new case, the cases most clinically-similar to it basically – we are using clustering (unsupervised learning) to find the clinically similar cases to a test case, and then multilable, multiclass classification (supervised learning) only on the retrieved similar cases – using their physician-given labels to make a prediction.
Preprocessing X Matrix: Attribute types:(a) empty : < 8 non-NA features removed(b) Date : regex + >90% of feature column entries matching the regex (allowing for non-pattern dates) transformed into 2 new attributes: month, year(c) Numeric: coeercision to character and back to numeric, if Nas produced by this < 16 , termed as numeric feature kept in numeric coeercsion(d) binary: multiple conditions + fuzzy detection consolidated to a single form of “true” and a single form of “false”(e) clean factor: under 25 categories (and no match for the previous feature types) no action(f) dirty factor: no match for previous + average string length < 20 characters 20 most frequeny levels remained, rest were termed “misc levels”(g) free text: no match for the previous no actionMissing dataConformed to NA statusPreprocessing Y Matrix: Originally in the .mdb format – each row was a general i.d , a feature (column) gave the respective neuroi.d (there could be two rows with the same neuro. Id), and for each type of diagnosis, a comment or numeric marker was given in a respective feature. binary diagnosis vector
* In this study – the weights were automatically calculated using a simple algorithmic approach* Other studies have used domain specific ontology - to give different weights to different terms, according to their clinical significance.
* Pij = relative probability Each such entropy is further normalized by log(n), n being the length of the corpus (the number of documents). This normalization was originally devised to give equal treatment to different size corpuses, but since in this project all textual attributes contain the same N.cases number of documents, this produces little effect. A possible improvement to implement in future versions is to replace this with a local normalization by the length of the document, so that the summing up of entropies will be of normalized values in respect to the document length. האנטרופיה היא מדד של פיזור של השימוש במונח ברחבי הקורפוס. בסוף, הסכום של כל pij שווה ל-1, אך ככל שהם מפוזרים יותר טוב ברחבי הקורפוס, כך הציון הגלובלי של המונח יגדל.
This is used in the VECTOR SPACE MODEL
Empirical studies show that truncating the lower singular values can enact noise reduction, and thus the algorithms transformed all singular values in S which were below a certain threshold (set at 10-3) to 0.
Attribute clinical weights:High similarity : if > 80% of cases have similarity to the test case > 0.8 divide by 2Average Input Length in Textual Attribute: if > 30 characters : multiply by 2Test case value for the attribute is NA divide by 3
In choosing the test cases, however, the distribution of diagnoses in the Y matrix was examined. Inspecting the prior probabilities of diagnoses in the case base shows that there are several diagnoses which occur only once in the entire Y matrix, while others occur in the singles, tens, hundreds and thousands.
350 case indexes in the final subsetFor each test case, a diagnoses probability prediction vector was outputted, for each of the combinatorial instance of <Retrieval Method (4 types), Reuse & Adapt Method (2 types), K value (8 values)>. That is, 64 diagnoses prediction probability vectors were generated for each test case.
The above 5 graph types were produced for each combination of K, Retrieval Method and Reuse Scheme (i.e. for the 8 X 4 X 2 = 64 distinct combinations). That is, 320 distinct graphs were produced to graphically assess the aggregated results for all test cases.
RMSE = sqrt(1/(P+N) sum_i (y_i - _i)^2) = Root-mean-squared error = summing for all diagnoses (for all i values), an aggregated normalized sum of the individual errors between the predictions and real values of the diagnoses vector. For each diagnosis, the error can be either 0 if the prediction is correct or 1 if the prediction is wrong. Since the output of RMSE is just a cutoff-independent scalar, this measure cannot be combined with other measures into a parametric curve. Accuracy = P( = Y) = estimated as: (TP + TN)/(P + N) = the number of correct predictions divided by the total number of diagnoses predicted = the probability of the algorithm to predict correctly = the rate of correct predictions attained by the algorithm
F measure = Weighted harmonic mean of precision (P) and recall (R) = 1/ (alpha*1/P + (1-alpha)*1/R) (van Rijsbergen, 1979) = If alpha=1/2, the mean is balanced. Sensitivity = Recall = TPrate = P( = + | Y = +) = estimated as: TP/P = True Positive Rate = number of true positives divided by the number of overall positives in the real diagnoses vector from the Y matrix = the algorithm's probability of predicting correctly which diagnoses the patient does have.Precision = PPV = P(Y = + | = +) = estimated as: TP/tstdP = TP/(TP+FP)= Positive Predictive Value = the number of true positives divided by the total number of diagnoses predicted by the algorithm as positives = the probability of a positive "1" prediction to be correct.
P value of the AUC ROC: tests the null hypothesis that the area under the curve really equals 0.50. In other words, the P value answers this question:What is the probability to receive the obtained AUC ROC (or higher) in case the diagnosis algorithm was no better than flipping a coin?
Another reason for choosing ML as an approach for developing a CDSS in NDD, is the need for future scalability – no need for per clinic rules modification