This document proposes a conceptual model for automatically matching individuals with health researchers for research studies using electronic medical record data. The model involves selecting relevant medical measurements for a "candidate" research participant, filtering individuals based on rules, reducing the data dimensions using principal component analysis, and calculating similarity between individuals' medical data using similarity coefficients. A simulation applies the model to a medical data set and demonstrates that it can significantly reduce the data needed to automatically match individuals for health research.
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Austin Statistics is an open access, peer reviewed, scholarly journal dedicated to publish articles in all areas of statistics.
The aim of the journal is to provide a forum for scientists, academicians and researchers to find most recent advances in the field statistics.
Austin Statistics accepts original research articles, review articles, case reports and rapid communication on all the aspects of statistics.
Experimental design and statistical power in swine experimentation: A reviewKareem Damilola
A review on experimental design and statistical power in swine experimentation. This review helps in gaining more insights into animal experimentation(s).
Data mining techniques are rapidly developed for many applications. In recent year, Data mining in healthcare is an emerging field research and development of intelligent medical diagnosis system. Classification is the major research topic in data mining. Decision trees are popular methods for classification. In this paper many decision tree classifiers are used for diagnosis of medical datasets. AD Tree, J48, NB Tree, Random Tree and Random Forest algorithms are used for analysis of medical dataset. Heart disease dataset, Diabetes dataset and Hepatitis disorder dataset are used to test the decision tree models. Aung Nway Oo | Thin Naing ""Decision Tree Models for Medical Diagnosis"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23510.pdf
Paper URL: https://www.ijtsrd.com/computer-science/data-miining/23510/decision-tree-models-for-medical-diagnosis/aung-nway-oo
August 1, 2010. Design of Non-Randomized Medical Device Trials Based on Sub-Classification Using Propensity Score Quintiles, Topic Contributed Session on Medical Devices, (Greg Maislin and Donald B Rubin). Joint Statistical Meetings 2010, Vancouver Canada.
Hiv Replication Model for The Succeeding Period Of Viral Dynamic Studies In A...inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Austin Statistics is an open access, peer reviewed, scholarly journal dedicated to publish articles in all areas of statistics.
The aim of the journal is to provide a forum for scientists, academicians and researchers to find most recent advances in the field statistics.
Austin Statistics accepts original research articles, review articles, case reports and rapid communication on all the aspects of statistics.
Experimental design and statistical power in swine experimentation: A reviewKareem Damilola
A review on experimental design and statistical power in swine experimentation. This review helps in gaining more insights into animal experimentation(s).
Data mining techniques are rapidly developed for many applications. In recent year, Data mining in healthcare is an emerging field research and development of intelligent medical diagnosis system. Classification is the major research topic in data mining. Decision trees are popular methods for classification. In this paper many decision tree classifiers are used for diagnosis of medical datasets. AD Tree, J48, NB Tree, Random Tree and Random Forest algorithms are used for analysis of medical dataset. Heart disease dataset, Diabetes dataset and Hepatitis disorder dataset are used to test the decision tree models. Aung Nway Oo | Thin Naing ""Decision Tree Models for Medical Diagnosis"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23510.pdf
Paper URL: https://www.ijtsrd.com/computer-science/data-miining/23510/decision-tree-models-for-medical-diagnosis/aung-nway-oo
August 1, 2010. Design of Non-Randomized Medical Device Trials Based on Sub-Classification Using Propensity Score Quintiles, Topic Contributed Session on Medical Devices, (Greg Maislin and Donald B Rubin). Joint Statistical Meetings 2010, Vancouver Canada.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
A Bifactor and Itam Response Theory Analysis of the Eating Disorder Inventory-3David Garner
The Eating Disorder Inventory-3 (EDI-3; Garner, 2004) is a 91-item, self-report measure scored on 12 scales (three Eating
Disorder Risk scales, nine Psychological scales) and six composites. A sample of 1206 female eating disorder patients was
divided randomly into calibration (n = 607) and cross-validation (n = 599) samples for confirmatory factor analyses. A bifactor
model best fit the data in both samples, but a model with second-order factors corresponding to the risk and psychological scales
approached the fit of the bifactor model.
ABSTRACT : This paper critically examined a broad view of Structural Equation Model (SEM) with a view
of pointing out direction on how researchers can employ this model to future researches, with specific focus on
several traditional multivariate procedures like factor analysis, discriminant analysis, path analysis. This study
employed a descriptive survey and historical research design. Data was computed viaDescriptive Statistics,
Correlation Coefficient, Reliability. The study concluded that Novice researchers must take care of assumptions
and concepts of Structure Equation Modeling, while building a model to check the proposed hypothesis. SEM is
more or less an evolving technique in the research, which is expanding to new fields. Moreover, it is providing
new insights to researchers for conducting longitudinal investigations.
.
DIFFERENTIAL OPERATORS AND STABILITY ANALYSIS OF THE WAGE FUNCTIONIJESM JOURNAL
In this paper, a differential operator has been used to solve the wage equation. The subsequent wage function is analyzed and interpreted for stability. The equation incorporates speculative parameters operating in free range. The variations of these parameters have caused stability and instability of the wage function in certain circumstances. Where the wage function is exponential, asymptotic stability towards the equilibrium wage rate is observed but where it consists of both exponential and periodic factors, the time path shows periodic fluctuations with successive cycles giving smaller amplitudes until the ripples dies naturally. It is also observed that though differential operator is just as effective as variation of parameters demonstrated in [6], it is rather simple and fast with limited algebra.
Classification accuracy analyses using Shannon’s EntropyIJERA Editor
There are many methods for determining the Classification Accuracy. In this paper significance of Entropy of
training signatures in Classification has been shown. Entropy of training signatures of the raw digital image
represents the heterogeneity of the brightness values of the pixels in different bands. This implies that an image
comprising a homogeneous lu/lc category will be associated with nearly the same reflectance values that would
result in the occurrence of a very low entropy value. On the other hand an image characterized by the
occurrence of diverse lu/lc categories will consist of largely differing reflectance values due to which the
entropy of such image would be relatively high. This concept leads to analyses of classification accuracy.
Although Entropy has been used many times in RS and GIS but its use in determination of classification
accuracy is new approach.
Generalized Additive and Generalized Linear Modeling for Children DiseasesQUESTJOURNAL
ABSTRACT: This paper is necessarily restricted to application of Generalised Linear Models(GLM) and Generalised Additive Models(GAM), and is intended to provide readers with some measure of the power of these mathematical tools for modeling Health/Illness data systems. We are all aware that illness, in general and children illness, in particular is amongst the most serious socio-economic and demographic problems in developing countries, and they have great impact on future development. In this paper we focus on some frequently occurring diseases among children under fourteen years of age, using data collected from various hospitals of Jammu district from 2011 to 2016.The success of any policy or health care intervention depends on a correct understanding of the socio economic environmental and cultural factors that determine the occurrence of diseases and deaths. Until recently, any morbidity information available was derived from clinics and hospitals. Information on the incidence of diseases, obtained from hospitals represents only a small proportion of the illness, because many cases do not seek medical attention .Thus, the hospital records may not be appropriate from estimating the incidence of diseases from programme developments. The use of DHS data in the understanding of the childhood morbidity has expanded rapidly in recent years. However, few attempts have been made to address explicitly the problems of non linear effects on metric covariates in the interpretation of results .This study shows how the GAM model can be adapted to extent the analysis of GLM to provide an explanation of non linear relationship of the covariate. Incorporation of non linear terms in the model improves the estimates in the terms of goodness of fit. The GLM model is explicitly specified by giving symbolic description of the linear predictor and a description of the error distribution and the GAM model is fit using the local scoring algorithm, which iteratively fits weighted additive models by back fitting. The back fitting algorithm is a Gauss-Seidel method of fitting additive models by the iteratively smoothing partial residuals. The algorithm separates the parametric from the non parametric parts of the fit, and fits the parametric part using weighted linear least squares within the back fitting algorithm.
Marginal Regression for a Bi-variate Response with Diabetes Mellitus Studytheijes
In this paper, we have developed an “A Bivariate response” model to determine for „A Diabetic Mellitus‟ Study affects large number of people of all social conditions throughout the world. Continuous to grow despite of existing advances in the past few years in virtually every fled of diabetes research and in-protect care for improved treatment. This is sometimes accompanied by symptoms of serve thirst. Profuse urination, weight loss and stopper. We tested by SPSS software by taking 200 samples and using Logistic Regression to estimate the relationship between the response probability whether a diabetic patient had B.P. or not.
Hellinger Optimal Criterion and 퓗푷푨- Optimum Designs for Model Discrimination...inventionjournals
Kullback-Leibler (KL) optimality criterion has been considered in the literature for model discrimination. However, Hellinger distance has many advantages rather than KL-distance. For that reason, in this paper a new criterion based on the Hellinger distance named by Hellinger (ℋ) -optimality criterion is proposed to discriminate between two rival models. An equivalence theorem is proved for this criterion. Furthermore, a new compound criterion is constructed that possess both discrimination and a high probability of desired outcome properties. Discrimination between binary and Logistic GLM are suggested based on the new criteria
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Innovative Technique for Gene Selection in Microarray Based on Recursive Clus...AM Publications
Gene selection is usually the crucial step in microarray data analysis. A great deal of recent research has focused on the
challenging task of selecting differentially expressed genes from microarray data (‘gene selection’). Numerous gene selection
algorithms have been proposed in the literature, but it is often unclear exactly how these algorithms respond to conditions like
small sample-sizes or differing variances. Choosing an appropriate algorithm can therefore be difficult in many cases. This paper
presents combination of Analysis of Variance (ANOVA), Principle Component Analysis (PCA), Recursive Cluster Elimination
(RCE) a classification algorithm by employing a innovative method for gene selection. It reduces the gene expression data into
minimal number of gene subset. This is a new feature selection method which uses ANOVA statistical test, principal component
analysis, KNN classification &RCE (recursive cluster elimination). At each step redundant & irrelevant features are get
eliminated. Classification accuracy reaches up to 99.10% and lesser time for classification when compared to other convectional techniques.
A Bifactor and Itam Response Theory Analysis of the Eating Disorder Inventory-3David Garner
The Eating Disorder Inventory-3 (EDI-3; Garner, 2004) is a 91-item, self-report measure scored on 12 scales (three Eating
Disorder Risk scales, nine Psychological scales) and six composites. A sample of 1206 female eating disorder patients was
divided randomly into calibration (n = 607) and cross-validation (n = 599) samples for confirmatory factor analyses. A bifactor
model best fit the data in both samples, but a model with second-order factors corresponding to the risk and psychological scales
approached the fit of the bifactor model.
ABSTRACT : This paper critically examined a broad view of Structural Equation Model (SEM) with a view
of pointing out direction on how researchers can employ this model to future researches, with specific focus on
several traditional multivariate procedures like factor analysis, discriminant analysis, path analysis. This study
employed a descriptive survey and historical research design. Data was computed viaDescriptive Statistics,
Correlation Coefficient, Reliability. The study concluded that Novice researchers must take care of assumptions
and concepts of Structure Equation Modeling, while building a model to check the proposed hypothesis. SEM is
more or less an evolving technique in the research, which is expanding to new fields. Moreover, it is providing
new insights to researchers for conducting longitudinal investigations.
.
DIFFERENTIAL OPERATORS AND STABILITY ANALYSIS OF THE WAGE FUNCTIONIJESM JOURNAL
In this paper, a differential operator has been used to solve the wage equation. The subsequent wage function is analyzed and interpreted for stability. The equation incorporates speculative parameters operating in free range. The variations of these parameters have caused stability and instability of the wage function in certain circumstances. Where the wage function is exponential, asymptotic stability towards the equilibrium wage rate is observed but where it consists of both exponential and periodic factors, the time path shows periodic fluctuations with successive cycles giving smaller amplitudes until the ripples dies naturally. It is also observed that though differential operator is just as effective as variation of parameters demonstrated in [6], it is rather simple and fast with limited algebra.
Classification accuracy analyses using Shannon’s EntropyIJERA Editor
There are many methods for determining the Classification Accuracy. In this paper significance of Entropy of
training signatures in Classification has been shown. Entropy of training signatures of the raw digital image
represents the heterogeneity of the brightness values of the pixels in different bands. This implies that an image
comprising a homogeneous lu/lc category will be associated with nearly the same reflectance values that would
result in the occurrence of a very low entropy value. On the other hand an image characterized by the
occurrence of diverse lu/lc categories will consist of largely differing reflectance values due to which the
entropy of such image would be relatively high. This concept leads to analyses of classification accuracy.
Although Entropy has been used many times in RS and GIS but its use in determination of classification
accuracy is new approach.
Generalized Additive and Generalized Linear Modeling for Children DiseasesQUESTJOURNAL
ABSTRACT: This paper is necessarily restricted to application of Generalised Linear Models(GLM) and Generalised Additive Models(GAM), and is intended to provide readers with some measure of the power of these mathematical tools for modeling Health/Illness data systems. We are all aware that illness, in general and children illness, in particular is amongst the most serious socio-economic and demographic problems in developing countries, and they have great impact on future development. In this paper we focus on some frequently occurring diseases among children under fourteen years of age, using data collected from various hospitals of Jammu district from 2011 to 2016.The success of any policy or health care intervention depends on a correct understanding of the socio economic environmental and cultural factors that determine the occurrence of diseases and deaths. Until recently, any morbidity information available was derived from clinics and hospitals. Information on the incidence of diseases, obtained from hospitals represents only a small proportion of the illness, because many cases do not seek medical attention .Thus, the hospital records may not be appropriate from estimating the incidence of diseases from programme developments. The use of DHS data in the understanding of the childhood morbidity has expanded rapidly in recent years. However, few attempts have been made to address explicitly the problems of non linear effects on metric covariates in the interpretation of results .This study shows how the GAM model can be adapted to extent the analysis of GLM to provide an explanation of non linear relationship of the covariate. Incorporation of non linear terms in the model improves the estimates in the terms of goodness of fit. The GLM model is explicitly specified by giving symbolic description of the linear predictor and a description of the error distribution and the GAM model is fit using the local scoring algorithm, which iteratively fits weighted additive models by back fitting. The back fitting algorithm is a Gauss-Seidel method of fitting additive models by the iteratively smoothing partial residuals. The algorithm separates the parametric from the non parametric parts of the fit, and fits the parametric part using weighted linear least squares within the back fitting algorithm.
Marginal Regression for a Bi-variate Response with Diabetes Mellitus Studytheijes
In this paper, we have developed an “A Bivariate response” model to determine for „A Diabetic Mellitus‟ Study affects large number of people of all social conditions throughout the world. Continuous to grow despite of existing advances in the past few years in virtually every fled of diabetes research and in-protect care for improved treatment. This is sometimes accompanied by symptoms of serve thirst. Profuse urination, weight loss and stopper. We tested by SPSS software by taking 200 samples and using Logistic Regression to estimate the relationship between the response probability whether a diabetic patient had B.P. or not.
Hellinger Optimal Criterion and 퓗푷푨- Optimum Designs for Model Discrimination...inventionjournals
Kullback-Leibler (KL) optimality criterion has been considered in the literature for model discrimination. However, Hellinger distance has many advantages rather than KL-distance. For that reason, in this paper a new criterion based on the Hellinger distance named by Hellinger (ℋ) -optimality criterion is proposed to discriminate between two rival models. An equivalence theorem is proved for this criterion. Furthermore, a new compound criterion is constructed that possess both discrimination and a high probability of desired outcome properties. Discrimination between binary and Logistic GLM are suggested based on the new criteria
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
Dichotomous data is a type of categorical data, which is binary with categories zero and one. Health care data is one of the heavily used categorical data. Binary data are the simplest form of data used for heath care databases in which close ended questions can be used; it is very efficient based on computational efficiency and memory capacity to represent categorical type data. Clustering health care or medical data is very tedious due to its complex data representation models, high dimensionality and data sparsity. In this paper, clustering is performed after transforming the dichotomous data into real by wiener transformation. The proposed algorithm can be usable for determining the correlation of the health disorders and symptoms observed in large medical and health binary databases. Computational results show that the clustering based on Wiener transformation is very efficient in terms of objectivity and subjectivity.
Running head Final Project Data Analysis1Final Project Data A.docxjeanettehully
Running head: Final Project Data Analysis 1
Final Project Data Analysis 2
Final Project Data Analysis:
Luz Rodriguez
Southern New Hampshire University
Process and calculations
In completing the research on the influence that gender (male/female) has over the length of the hospital stay. We can use several types of statistical tests in analysis a more accurate analysis of the research question. This involves a dot plot and a histogram. In responding to this question, we can place gender in one category but studying it under two separate samples, male and female and the effects of length of stay after a myocardial infarction. We can compute this by resolving quantitative data and the relationship between the two factors s dot plot and a histogram would be effective in achieving this analysis.
Research question
To what extent does gender influence length of hospital stay for MI patients?
Response and predictor variables
Response: Length of hospital stay (LOS)
-Predictor: Gender (female and male)
Type of variable for predictor variable
Predictor: gender (female or male)
Type of diagram for analysis
Dot plot
Histogram
Data analysis
As shown the data tries to compare the differences between gender (male and Female) and the length of stay in hospitals with respect to each other. It’s clear that the length of hospital stay which is represented by 0 is shorter as compared to that of the female which is represented by 1. If there is a larger differences between the two genders, then there is a meaning which would reduce the standard deviation (Gerstman, 2015).
gender
n
mean
variance
Std. dev
Std. err.
median
range
min
max
Q1
Q3
0
65
0
0
0
0
0
0
0
0
0
0
1
35
1
0
0
0
1
0
1
1
1
1
Hypothesis test results:
Difference
Sample Diff.
Std. Err.
DF
T-Stat
P-value
μ1 - μ2
6.49
0.59375453
198
10.930443
<0.0001
References
Gerstman, B. B. (2015). Basic Biostatistics Statistics for Public Health (2nd ed.). Burlington, MA: Jones & Bartlett Learning.
gender 1 0.0 10.0 20.0 30.0 40.0 50.0 60.0 30.0 4.0 0.0 0.0 0.0 0.0 1.0
gender 0 2.0 17.0 34.0 3.0 6.0 1.0 3.0 gender 1 30.0 4.0 0.0 0.0 0.0 0.0 1.0
gender 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 2.0 17.0 34.0 3.0 6.0 1.0 3.0
Course ProjectCriteriaPointsDescribes the patient that is the subject of the project including diagnoses, medications, and history OR describes the community, its strengths and problems and the mental health issue that will be the subject of the paper.4Includes any substance abuse or violence issues for the patient or community 2Discusses attempted interventions, what has been successful and what has not.4Describes own personal thoughts about the patient's or community's mental health issues. 4Describes any cognitive concerns and possible interventions.2Writes a nursing care plan including three priority nursing diagnoses with r/t and AEB factors.4Includes outcomes in Nursing Outcomes Classification language and interventions in Nursing Intervention Classificati ...
Running head Final Project Data Analysis1Final Project Data A.docxwlynn1
Running head: Final Project Data Analysis 1
Final Project Data Analysis 2
Final Project Data Analysis:
Luz Rodriguez
Southern New Hampshire University
Process and calculations
In completing the research on the influence that gender (male/female) has over the length of the hospital stay. We can use several types of statistical tests in analysis a more accurate analysis of the research question. This involves a dot plot and a histogram. In responding to this question, we can place gender in one category but studying it under two separate samples, male and female and the effects of length of stay after a myocardial infarction. We can compute this by resolving quantitative data and the relationship between the two factors s dot plot and a histogram would be effective in achieving this analysis.
Research question
To what extent does gender influence length of hospital stay for MI patients?
Response and predictor variables
Response: Length of hospital stay (LOS)
-Predictor: Gender (female and male)
Type of variable for predictor variable
Predictor: gender (female or male)
Type of diagram for analysis
Dot plot
Histogram
Data analysis
As shown the data tries to compare the differences between gender (male and Female) and the length of stay in hospitals with respect to each other. It’s clear that the length of hospital stay which is represented by 0 is shorter as compared to that of the female which is represented by 1. If there is a larger differences between the two genders, then there is a meaning which would reduce the standard deviation (Gerstman, 2015).
gender
n
mean
variance
Std. dev
Std. err.
median
range
min
max
Q1
Q3
0
65
0
0
0
0
0
0
0
0
0
0
1
35
1
0
0
0
1
0
1
1
1
1
Hypothesis test results:
Difference
Sample Diff.
Std. Err.
DF
T-Stat
P-value
μ1 - μ2
6.49
0.59375453
198
10.930443
<0.0001
References
Gerstman, B. B. (2015). Basic Biostatistics Statistics for Public Health (2nd ed.). Burlington, MA: Jones & Bartlett Learning.
gender 1 0.0 10.0 20.0 30.0 40.0 50.0 60.0 30.0 4.0 0.0 0.0 0.0 0.0 1.0
gender 0 2.0 17.0 34.0 3.0 6.0 1.0 3.0 gender 1 30.0 4.0 0.0 0.0 0.0 0.0 1.0
gender 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 2.0 17.0 34.0 3.0 6.0 1.0 3.0
Course ProjectCriteriaPointsDescribes the patient that is the subject of the project including diagnoses, medications, and history OR describes the community, its strengths and problems and the mental health issue that will be the subject of the paper.4Includes any substance abuse or violence issues for the patient or community 2Discusses attempted interventions, what has been successful and what has not.4Describes own personal thoughts about the patient's or community's mental health issues. 4Describes any cognitive concerns and possible interventions.2Writes a nursing care plan including three priority nursing diagnoses with r/t and AEB factors.4Includes outcomes in Nursing Outcomes Classification language and interventions in Nursing Intervention Classificati.
Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparis...IJECEIAES
Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open- source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers‟ performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Reply DB5 w9 research
Reply discussion boards
1-jauregui
Discuss how the quantitative and qualitative data would complement one another and add strength to the study.
Evidently, the use of EBP in healthcare mostly relies on the available qualitative and quantitative data which is supported by scientific or clinical research. In studying the EBP, quantitative data is used to enhance qualitative information and vice versa, because one method complements the other one (Tappen, 2015, p.88). For example, in the selected article the EBP about beliefs and behaviors of nurses showed that the number of the nurses who were certified vs. nurses who were not certified explained why some of the nurses have higher perceived EBP implementation than others (Eaton, Meins, Mitchell, Voss, & Doorenbos, 2015, “Evidence-Based Practice Beliefs and Behaviors”). Quantitative data would improve the study by providing evidence in the form of numbers or amounts such as the scores which show the proficiency of nurses in different areas (Eaton, Meins, Mitchell, Voss, & Doorenbos, 2015, “Evidence-Based Practice Beliefs and Behaviors”). Quantitative data could strengthen the study by providing more detailed information about EBP implementation which will explain certain trends and occurrences as found in the research.
2- rosquete
The qualitative research is exploratory/ descriptive and emphasizes the importance of subjects frame to be referenced and the context of the study. The research will be more concerned with the truth perceived by informants and less concerned with the truth of the objectives. The information from this research will be important in understanding the informants’ behaviors in details. The description of this approach will be used to get the picture and the opinion of nursing caregivers on the use of CNS depressants by the elderly (Susan, Nancy, & Jennifer, 2013).
The method that is used is explorative/descriptive. The strengths of the descriptive method are: effective to analyze non-quantified subjects and issues, the possibility to observe the phenomenon in a natural environment, the opportunity to use qualitative and quantitative method together, and less time consuming than quantitative studies. In the case of exploratory studies, the principal advantage is the flexibility and adaptability to change and it is effective in laying the groundwork that guides to future research. We can find disadvantages in this kind of studies. For example, descriptive studies cannot test or verify the research problem statically, the majority of descriptive studies are not repeatable due to their observational nature, and they are not helpful in identifying cause behind the described phenomenon. Another weak point, that includes exploratory research, is the interpretation of information is subject to bias. These type of studies make use a modest number of samples that may not represent the target population and they are not usually helpful in decision ma.
FedCASIC 2019: Survey Respondent Segmentation: Trust in Government SurveysLew Berman
Surveys are universally experiencing falling response rates and rising data collection costs. Numerous studies focused on nonrespondents and the risk of bias Many people don’t respond to surveys. We explore the reasons why they participate to help inform ways to motivate more people to respond. This work is
Sparked by the market research conducted for the 2010 Census. We investigate attitudinal motivators and barriers to government survey participation, particularly health surveys.
FedCASIC 2019: Topic Salience and Propensity to Respond to Surveys: Findings ...Lew Berman
Main Theories on Nonresponse (Tourangeau and Piewes 2013) include Social Capital , Leverage-Salience Theory, Social Exchange Theory. The origins of Leverage-Salience Theory are from Groves and Couper (1998). Researchers observe that respondents vary in terms of the attributes of a survey request that they judge as relevant to their decision to participate. Thus Expert interviewers tailor the features of their request to heighten the salience of those elements they think will be most favorably received by potential respondent.
We look at topic salience and propensity to respond to surveys using findings from a study using a national mobile panel.
FedCASIC 2019: Designing, implementing, and analyzing Leverage Saliency Theor...Lew Berman
Description of an experiment using LST to determine who is more likely to participate in a survey based on survey topic. Specifically will incentive of $25 or $50 lead to an increase in participation, will increasing survey duration impact participation, do incentive and duration predict willingness to participate, and do the main effects vary across topics and importance.
IFD&TC 2012: Validating in-home Measures for the National Health Interview Su...Lew Berman
Design considerations for an experiment to validate that physical measures and blood can be collected in-home comparable to that collected in the CDC NHANES Mobile Examination Center. Results from the validation study would be used to confirm application of these methods for the CDC NHIS Study.
IFD&TC 2019: Technical Challenges and Solutions in Center ManagementLew Berman
This panel will discuss two sets of challenges and solutions in center management. Telephone based surveys typically use supervisory staff to live-monitor interviewers or manually review recordings for effective speech rate, properly reading a question, and accurately recording a response. General industry practice is to review 5-10% of all calls. However, these practices are labor intensive, subjective, and the evidence for this range of review is anecdotal. This panel will discuss current call center practices and new technical solutions for improving these practices. Spam blockers are having an increasing impact on our ability to contact respondents by phone. Panelists will share research about spam blockers, how they affect centers with different telephone systems, and possible solutions to ensure your calls get through to study participants.
Data Science Training and Workforce DevelopmentLew Berman
Overall, the demand for data science talent is outpacing the current supply, and many who are being trained in data science methods are pursuing careers in sectors other than public service or biomedical/behavioral research. On June 6, 2018 ICF hosted a workshop with participants from academia, industry
organizations and the federal government. This paper summarizes the key findings from this workshop.
Willingness and Reasons for Unlikeliness to Share Child Immunization Records ...Lew Berman
Poster presentation at the 2018 National Immunization Conference on willingness of survey participants to share child immunization records. The survey, The Childhood Immunization Mobile Panel Survey II (ChIMPS II), was a methodology study to assess mode, introduction, and content variations for the National Immunization Survey (NIS). This study used a smart phone panel because it offered easier administration, lower cost, and respondent convenience. One content variation focused on assessing the willingness of a respondent to provide permission for the Centers for Disease Control and Prevention (CDC) to access their children’s medical vaccine records. The objective of the poster analyses was to describes the willingness of respondents to share vaccine records with CDC as part of smartphone survey, and the reasons respondents gave for being unwilling or unsure about sharing their child’s medical records with CDC.
Tom Selleck Health: A Comprehensive Look at the Iconic Actor’s Wellness Journeygreendigital
Tom Selleck, an enduring figure in Hollywood. has captivated audiences for decades with his rugged charm, iconic moustache. and memorable roles in television and film. From his breakout role as Thomas Magnum in Magnum P.I. to his current portrayal of Frank Reagan in Blue Bloods. Selleck's career has spanned over 50 years. But beyond his professional achievements. fans have often been curious about Tom Selleck Health. especially as he has aged in the public eye.
Follow us on: Pinterest
Introduction
Many have been interested in Tom Selleck health. not only because of his enduring presence on screen but also because of the challenges. and lifestyle choices he has faced and made over the years. This article delves into the various aspects of Tom Selleck health. exploring his fitness regimen, diet, mental health. and the challenges he has encountered as he ages. We'll look at how he maintains his well-being. the health issues he has faced, and his approach to ageing .
Early Life and Career
Childhood and Athletic Beginnings
Tom Selleck was born on January 29, 1945, in Detroit, Michigan, and grew up in Sherman Oaks, California. From an early age, he was involved in sports, particularly basketball. which played a significant role in his physical development. His athletic pursuits continued into college. where he attended the University of Southern California (USC) on a basketball scholarship. This early involvement in sports laid a strong foundation for his physical health and disciplined lifestyle.
Transition to Acting
Selleck's transition from an athlete to an actor came with its physical demands. His first significant role in "Magnum P.I." required him to perform various stunts and maintain a fit appearance. This role, which he played from 1980 to 1988. necessitated a rigorous fitness routine to meet the show's demands. setting the stage for his long-term commitment to health and wellness.
Fitness Regimen
Workout Routine
Tom Selleck health and fitness regimen has evolved. adapting to his changing roles and age. During his "Magnum, P.I." days. Selleck's workouts were intense and focused on building and maintaining muscle mass. His routine included weightlifting, cardiovascular exercises. and specific training for the stunts he performed on the show.
Selleck adjusted his fitness routine as he aged to suit his body's needs. Today, his workouts focus on maintaining flexibility, strength, and cardiovascular health. He incorporates low-impact exercises such as swimming, walking, and light weightlifting. This balanced approach helps him stay fit without putting undue strain on his joints and muscles.
Importance of Flexibility and Mobility
In recent years, Selleck has emphasized the importance of flexibility and mobility in his fitness regimen. Understanding the natural decline in muscle mass and joint flexibility with age. he includes stretching and yoga in his routine. These practices help prevent injuries, improve posture, and maintain mobilit
MANAGEMENT OF ATRIOVENTRICULAR CONDUCTION BLOCK.pdfJim Jacob Roy
Cardiac conduction defects can occur due to various causes.
Atrioventricular conduction blocks ( AV blocks ) are classified into 3 types.
This document describes the acute management of AV block.
The prostate is an exocrine gland of the male mammalian reproductive system
It is a walnut-sized gland that forms part of the male reproductive system and is located in front of the rectum and just below the urinary bladder
Function is to store and secrete a clear, slightly alkaline fluid that constitutes 10-30% of the volume of the seminal fluid that along with the spermatozoa, constitutes semen
A healthy human prostate measures (4cm-vertical, by 3cm-horizontal, 2cm ant-post ).
It surrounds the urethra just below the urinary bladder. It has anterior, median, posterior and two lateral lobes
It’s work is regulated by androgens which are responsible for male sex characteristics
Generalised disease of the prostate due to hormonal derangement which leads to non malignant enlargement of the gland (increase in the number of epithelial cells and stromal tissue)to cause compression of the urethra leading to symptoms (LUTS
Explore natural remedies for syphilis treatment in Singapore. Discover alternative therapies, herbal remedies, and lifestyle changes that may complement conventional treatments. Learn about holistic approaches to managing syphilis symptoms and supporting overall health.
Title: Sense of Smell
Presenter: Dr. Faiza, Assistant Professor of Physiology
Qualifications:
MBBS (Best Graduate, AIMC Lahore)
FCPS Physiology
ICMT, CHPE, DHPE (STMU)
MPH (GC University, Faisalabad)
MBA (Virtual University of Pakistan)
Learning Objectives:
Describe the primary categories of smells and the concept of odor blindness.
Explain the structure and location of the olfactory membrane and mucosa, including the types and roles of cells involved in olfaction.
Describe the pathway and mechanisms of olfactory signal transmission from the olfactory receptors to the brain.
Illustrate the biochemical cascade triggered by odorant binding to olfactory receptors, including the role of G-proteins and second messengers in generating an action potential.
Identify different types of olfactory disorders such as anosmia, hyposmia, hyperosmia, and dysosmia, including their potential causes.
Key Topics:
Olfactory Genes:
3% of the human genome accounts for olfactory genes.
400 genes for odorant receptors.
Olfactory Membrane:
Located in the superior part of the nasal cavity.
Medially: Folds downward along the superior septum.
Laterally: Folds over the superior turbinate and upper surface of the middle turbinate.
Total surface area: 5-10 square centimeters.
Olfactory Mucosa:
Olfactory Cells: Bipolar nerve cells derived from the CNS (100 million), with 4-25 olfactory cilia per cell.
Sustentacular Cells: Produce mucus and maintain ionic and molecular environment.
Basal Cells: Replace worn-out olfactory cells with an average lifespan of 1-2 months.
Bowman’s Gland: Secretes mucus.
Stimulation of Olfactory Cells:
Odorant dissolves in mucus and attaches to receptors on olfactory cilia.
Involves a cascade effect through G-proteins and second messengers, leading to depolarization and action potential generation in the olfactory nerve.
Quality of a Good Odorant:
Small (3-20 Carbon atoms), volatile, water-soluble, and lipid-soluble.
Facilitated by odorant-binding proteins in mucus.
Membrane Potential and Action Potential:
Resting membrane potential: -55mV.
Action potential frequency in the olfactory nerve increases with odorant strength.
Adaptation Towards the Sense of Smell:
Rapid adaptation within the first second, with further slow adaptation.
Psychological adaptation greater than receptor adaptation, involving feedback inhibition from the central nervous system.
Primary Sensations of Smell:
Camphoraceous, Musky, Floral, Pepperminty, Ethereal, Pungent, Putrid.
Odor Detection Threshold:
Examples: Hydrogen sulfide (0.0005 ppm), Methyl-mercaptan (0.002 ppm).
Some toxic substances are odorless at lethal concentrations.
Characteristics of Smell:
Odor blindness for single substances due to lack of appropriate receptor protein.
Behavioral and emotional influences of smell.
Transmission of Olfactory Signals:
From olfactory cells to glomeruli in the olfactory bulb, involving lateral inhibition.
Primitive, less old, and new olfactory systems with different path
Prix Galien International 2024 Forum ProgramLevi Shapiro
June 20, 2024, Prix Galien International and Jerusalem Ethics Forum in ROME. Detailed agenda including panels:
- ADVANCES IN CARDIOLOGY: A NEW PARADIGM IS COMING
- WOMEN’S HEALTH: FERTILITY PRESERVATION
- WHAT’S NEW IN THE TREATMENT OF INFECTIOUS,
ONCOLOGICAL AND INFLAMMATORY SKIN DISEASES?
- ARTIFICIAL INTELLIGENCE AND ETHICS
- GENE THERAPY
- BEYOND BORDERS: GLOBAL INITIATIVES FOR DEMOCRATIZING LIFE SCIENCE TECHNOLOGIES AND PROMOTING ACCESS TO HEALTHCARE
- ETHICAL CHALLENGES IN LIFE SCIENCES
- Prix Galien International Awards Ceremony
Report Back from SGO 2024: What’s the Latest in Cervical Cancer?bkling
Are you curious about what’s new in cervical cancer research or unsure what the findings mean? Join Dr. Emily Ko, a gynecologic oncologist at Penn Medicine, to learn about the latest updates from the Society of Gynecologic Oncology (SGO) 2024 Annual Meeting on Women’s Cancer. Dr. Ko will discuss what the research presented at the conference means for you and answer your questions about the new developments.
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Ve...kevinkariuki227
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Verified Chapters 1 - 19, Complete Newest Version.pdf
TEST BANK for Operations Management, 14th Edition by William J. Stevenson, Verified Chapters 1 - 19, Complete Newest Version.pdf
These simplified slides by Dr. Sidra Arshad present an overview of the non-respiratory functions of the respiratory tract.
Learning objectives:
1. Enlist the non-respiratory functions of the respiratory tract
2. Briefly explain how these functions are carried out
3. Discuss the significance of dead space
4. Differentiate between minute ventilation and alveolar ventilation
5. Describe the cough and sneeze reflexes
Study Resources:
1. Chapter 39, Guyton and Hall Textbook of Medical Physiology, 14th edition
2. Chapter 34, Ganong’s Review of Medical Physiology, 26th edition
3. Chapter 17, Human Physiology by Lauralee Sherwood, 9th edition
4. Non-respiratory functions of the lungs https://academic.oup.com/bjaed/article/13/3/98/278874
Pulmonary Thromboembolism - etilogy, types, medical- Surgical and nursing man...VarunMahajani
Disruption of blood supply to lung alveoli due to blockage of one or more pulmonary blood vessels is called as Pulmonary thromboembolism. In this presentation we will discuss its causes, types and its management in depth.
Pulmonary Thromboembolism - etilogy, types, medical- Surgical and nursing man...
Berman pcori challenge document
1. A Conceptual Model of Using Medical Measures
To Match Individuals for Health Research
Note: This work is derived from my Doctoral Dissertation, completed May 2011 at George
Washington University.
Lewis E. Berman, PhD, MS
April 15, 2013
Abstract
Lower survey and study response rates and higher costs provide significant challenges to
carry out biomedical and public health research. Increasingly health studies desire larger sample
sizes in order to analyze illnesses that may occur with low prevalence in the population.
Moreover, sub-group delineation is required in order to assess illness in hard to reach groups or
those groups that may occur with lower frequency in the general population.
The increasing availability of electronic medical information may serve as the foundation
for automatically matching individuals with health researchers for the purposes of advancing
health research. As electronic health records become the norm in the delivery of care, the record
and feature space for this data will become quite large. This will provide the basis for accurately
matching individuals with health researchers and projects.
This paper proposes a conceptual model to match individuals using filtering, data
reduction, and similarity coefficients. The filtering and data reduction steps reduce the scale of
the problem from a computational perspective. A simulation of the conceptual model is
illustrated. The findings from the simulation demonstrate that the record and feature space can be
significantly reduced and automated.
1 Introduction
There has been an increase in the demand for information access due to the widespread
use and ubiquitous nature of the Internet. Concurrently, medicine has undergone significant
change in equipment, procedures, treatments, monitoring, and specialization. In addition, the
federal government of the United States (U.S.) is investing in health information technology
(HIT) and electronic health records (EHR) with the hope that it will improve health [1].
Currently, individuals self-select into online health communities or pre-defined groups.
An alternative to self-selection is automated formation of health communities using medical
measurements. In essence, a “matchmaking mechanism” between patients can be automated
using medical measurements from an electronic health record [2, page 6]. While matching may
be done for social support, it may also be done for the purposes of health research.
1.1 Problem Statement
A common problem across disparate disciplines is matching and grouping objects based
on feature similarity. This is a classification. In the biological sciences, classification has been
emphasized to develop taxonomies such as the well-defined classification of the animal kingdom.
2. Currently, health studies utilize phone calling, mailing, and door-to-door visits to recruit
and match individuals for health research studies. It is widely agreed that health studies, and
studies in general, are achieving lower response rates for a variety of reasons. Moreover, in
attempting to recruit participants into these studies, the participant selection criterion is typically
limited by time and money. While this approach has some merit when considering the trade-off
between screening detail and cost, it is limiting since a study may be interested in recruiting large
numbers of individuals into a study and may need very detailed information for selection
purposes. So, an alternative to manual matching and selection is needed.
Therefore, this paper proposes to build a conceptual model for grouping individuals
based on electronically available medical measurements. The model consists of filtering, data
reduction, and similarity computation.
1.2 Research Approach and Organization of the Paper
The research approach in this paper is to develop the conceptual model and simulate the
model with a database of medical measurements. Section 2 is review of the relevant literature.
Section 3 presents the conceptual model and a simulation example. Section 4 presents the
simulation results. Section 5 is discusses the results. The last chapter is the conclusion.
2 Literature Review
This chapter is a review of the computational techniques related to the development of a
conceptual model for matching individuals. The topics cover medical measurement data types,
data reduction, and similarity coefficients.
2.1 Medical Measurement Data Types
Measurement is defined as the assignment of a number to an attribute of some instance of
an object. An important consideration in measurement is that the “properties of the attribute are
faithfully represented as numerical properties” as described by Krantz [3, page 1]. Medical
measurements are the result of tests, procedures, treatments, health history questions, or
diagnoses, and articulate an individual’s health state.
In general, there are four measurement types that may be assigned to medical
measurements. The first type is nominal measurement, which separates data into discrete groups
that are mutually exclusive. The second type is ordinal measurement. Ordinal measurement
assigns objects to categories such that these categories have a meaningful rank. In
epidemiological research, people may be pooled into different fitness groups such as poor, good,
and outstanding based on an individual’s perception of fitness level. While there is an ordering
and a sense of the magnitude difference between fitness groups, it is not possible to determine the
actual difference between groups. A third measurement type is interval. An example of an
interval measurement is Fahrenheit temperature. A temperature of 80° F is greater than a
temperature of 60° F. However, temperature, like all interval measurements, has two interesting
distinctions. First, a temperature of 0° F does not suggest the absence of temperature. Secondly,
even though temperature measurements possesses equal intervals it is not the case that there is a
true zero point and as a result, ratios between interval measures do not exist. Thus, 100° F is not
twice as hot as 50° F. The fourth measurement type is ratio. Ratio is much like interval except is
has an absolute zero point. Thus, a person who weighs 200 pounds is twice as heavy as a person
weighing 100 pounds and a 50-pound difference between any two weights always has the same
meaning [4, 5].
2
3. 2.2 Data Reduction
The definition of data reduction is the process of converting large sets of data into a
smaller number of data points. Mathematically, data reduction is the transformation of an n-
dimensional vector of observed data points or measurements, m = (m1, m2, …, mn), to a k-
dimensional vector of variables t = (t1, t2, …, tk) such that k≤n. In addition, the transformation
from m to t adheres to some criterion [6].
Data reduction methods fall into linear and non-linear methods. Some well-used linear
methods include Principal Component Analysis (PCA) and Factor Analysis (FA). Non-linear
methods include Principal Curves (PC), Multidimensional Scaling (MDS), and Neural Networks
(NN). The linear methods are considered easier to implement than non-linear methods [6]. PCA
has been applied in biology, medicine, chemistry, meteorology, and the social sciences [6, 7].
2.3 Similarity
Similarity is the basis for classification and is defined to be the amount of resemblance
between two objects based on the distinct information pertaining to the variables (i.e., features) of
the objects [8]. Similarity coefficients have been applied to several fields such as manufacturing
systems, plant breeding, seed bank management, high throughput screening of chemical datasets,
and determining the molecular markers of genetic relationships between individuals [9, 10, 11,
12].
In 1901, Jaccard created the earliest similarity coefficient [13, 14]. There are a number of
other similarity coefficients. However, some coefficients such as geometric and ontological are
not suitable for this work because they restrict the type of measurement types that can be used or
a single feature may adversely skew the results. Therefore, this paper explores three commonly
used coefficients, developed by Jaccard, Gower, and Tversky, which are not as susceptible to
these issues.
2.3.1 Jaccard Coefficient
The Jaccard Coefficient (JC) is feature-based model (FBM) which uses common and
unique features to compute similarity between objects. As shown in Equation 1 JC computes the
ratio of the number of features in common between two objects and the total number features in
common plus the number of features possessed uniquely by each of the two objects.
Jaccard a Where:
(1)
Coefficient:
a b c a = # of features in common
st
b = # of features possessed by 1 object
c = # of features possessed by 2nd object
2.3.2 Tversky Feature Contrast Similarity Model
Tversky suggested using a set-theoretical approach known as the feature contrast model.
The Tversky Feature Contrast Model Coefficient (TFCMC) computes similarity as a linear
combination of the common and unique features of individual objects. Thus, for two objects A
and B, there is a similarity function S; non-negative set functions f and g that define the weights
of individual features and how they are combined; and two constants θ, α, β ≥ 0 such that [16]:
𝑆(𝐴, 𝐵) = ∅𝑔(𝐴 ∩ 𝐵) − (𝛼𝑓(𝐴 − 𝐵) + 𝛽𝑓(𝐵 − 𝐴)) (2)
3
4. 2.3.3 Gower’s Model
In 1971, Gower proposed a similarity coefficient that could simultaneously use variables
of different measurement scales [8]. Gower computed the similarity between two objects, A and
B, as follows:
p
S ( A, B) k
S A, B p
k 1
(3)
W ( A, B) k
k 1
For nominal or ordinal data S(A,B)k = 1 when the feature values are the same and 0
otherwise. For interval or ratio data S(A,B)k = 1 - | fAk – fBk | / Rk such that fAk and fBk are the
values of the features for objects A and B; Rk equals the range for feature k across all objects (i.e.,
persons). In essence, this function scales the real valued features. A second feature of the Gower
coefficient (GC) is the denominator, W(A,B)k, which is a type of binary weighting variable. It
takes a value of 1 when the comparison between feature fAk and fBk, for objects A and B, is
considered valid. Otherwise, it is equal to 0.
3 Conceptual Model
This paper proposes a conceptual model to match individuals for medical research. As
illustrated in Figure 1, the conceptual model progresses through candidate measurement vector
(CMV) selection, rule-based filtering, principal component analysis (PCA) data reduction, and
similarity computation. This chapter will describe the steps in the conceptual model, criteria for
selection of a simulation dataset, and a description of the simulation example.
3.1 Candidate Measurement Vector Selection
It is assumed that individuals are being grouped together to match with the objective of a
research study proposed by a research scientist. To match individuals a hypothetical “candidate”
individual is created to represent the features of a typical member of the group. The “candidate”
consists of a specific set of medical measurements related to the features of people needed for the
research study. In a typical research study, the investigator and their team define the features of
interest for the patient population. However, this algorithm allows the selection process to be
sensitive to the desires of the patient population by augmenting the feature set of the “candidate”.
For example, a research scientist might be interested in recruiting individuals with type 2
diabetes into a study on diabetes co-morbidity factors. In this conceptual model the first step is
for the research scientist to prepare a candidate measurement vector (CMV) that includes the type
2 diabetes co-morbidity measurement vector. In this case, a CMV could include measurements
for the history of smoking, high blood pressure, body mass index equaling overweight, and
medication used to control high blood pressure and diabetes. Conversely, the patient population
might be interested in issues such as quality of life and familial history. These patient selected
features are included in the CMV. The data reduction step uses the CMV as input.
4
5. Figure 1. Conceptual model for matching individuals.
3.2 Rule-Based Filtering
The first step in the conceptual model is to filter out individuals using a rule set. The
rules are declarative statements that in affect constrain the individuals that may be used for
matching. A rule is a declarative statement as shown in equation 4. The predicates of R, (P1,P2,
…,Pj), are operators used to express the logic of the filter. The operators are typically {>, <, ≠, =,
≥, ≤}. Filtering is O(N), where N is the number of records in the dataset.
𝑅: 𝐼𝑓 (𝑃1 ⋀ 𝑃2 … ⋀ 𝑃𝑗 ) 𝑡ℎ𝑒𝑛 {𝑅𝑒𝑡𝑎𝑖𝑛 | 𝐷𝑒𝑙𝑒𝑡𝑒} (4)
Filtering is computed in two ways. First, a database is filtered according to demographic
information such as age ranges, gender, and geography. Secondly, the database is filtered
according to temporal criteria delineating when medical events or measurements must occur. For
example, a CMV containing elevated total cholesterol may be grouped with an individual having
a similar diagnosis during the same time. Total cholesterol measurements less than 200 are
considered desirable [17]. Figure 2 illustrates this situation with a temporal overlap between two
individuals based on a similar total cholesterol value.
TCHOL=
Potential Match
265
TCHOL= TCHOL=
Candidate
185 260
t→
Figure 2. Simple events with temporal overlap.
3.3 Data Reduction
The third step in the computational model is data reduction. Data reduction is used to
improve efficiency by reducing the number of measurements used to compute similarity.
Principal Component Analysis (PCA) is used specifically for data reduction [6] and has been used
in health research [18].
5
6. PCA takes independent measurements and reduces them to a smaller set of elements
known as principal components (PC). The PCs are uncorrelated and represent most of the
information in the original set of measurements [7]. The goal of PCA is to summarize the
interrelationships for a set of measurements with a smaller set of uncorrelated orthogonal PCs that
are linear combinations of the original measurements [19]. The PCs explains the maximum
amount of variance possible in the observed measurements with a smaller set of linearly
transformed variables [6, 7]. If only a few principal components explain a high proportion of the
variance in the observed variables and only a few of the measurements are highly correlated with
these PCs, than the dataset can be reduced with a small loss of information.
PCA results in a correlation matrix in which each element has a range of -1.0 to +1.0,
representing the correlation, rxy, between two elements. The higher the absolute value of rxy the
stronger the relationship is between two types of measurements. An absolute value of rxy between
.50 - .69 is a moderate strength of relationship, between .70 - .89 is considered a strong
relationship, and between .90 – 1.00 is considered a very strong relationship [18].
PCA also produces a solution to the characteristic equation of the correlation matrix.
Solving this equation results in eigenvalues and an eigenvector representing the variance in the
measurements and loadings associated with each item in the correlation matrix. The loadings
represent the correlation of an item with a PC. The sum of the loadings is equal to the total
variance that is explained by a PC. Similarly, since the total variance is known, the proportion of
the total variance explained by a PC is equal to the sum of the loadings on a PC divided by the
total variance, where the total variance is equal to the number of measurements [18].
3.4 Similarity
The last step in the conceptual model is similarity computation. The JC, TFCMC, and
GC coefficients are used and compared in the simulation. GC is appealing since it is computed
on the raw data and can use all measurement types directly. Conversely, a drawback of JC and
TFCMC is that they operate on binary datasets. Each measurement is recoded to a binary value
to accommodate this requirement. Similarity computation results in a value that assigns a value to
the degree of likeness between two objects.
3.4.1 Tolerance Ranges
A single measurement from two individuals, of the same data type, can be an exact
match. However, these two values may differ but be considered equivalent from a clinical
perspective. For example, an individual with a blood pressure of 110/80 and another with 115/80
would both have normal blood pressure. However, if JC or TFCMC is used, than these two
individuals would not be considered a match unless some procedure is used to account for the
blood pressure readings being essentially the same.
There are two approaches to this problem. The first approach is to define a percentage-
based tolerance range (PBTR). A PBTR is determined by a tolerance level, τ, which is defined
for the set of measurements. The tolerance level establishes a lower and upper value for each
measurement. This establishes the range of values for a measurement that are considered equal to
that in the CMV. As shown in equation 5, the tolerance range for the jth measurement is
determined by the value of that measurement for the CMV, c, and the tolerance level τ.
𝑇𝑗 = (𝑚 𝑐𝑗 ∗ (1 − 𝜏), 𝑚 𝑐𝑗 ∗ (1 + 𝜏)) (5)
6
7. For example, assume a tolerance of 20% is used for body weight. If the CMV has a body
weight measurement of 200 pounds, than the PBTR for body weight is Tj = (180, 220). Thus, an
individual with a body weight in this range is considered similar to the CMV for this feature.
Conversely, someone with a body weight of 245 is not considered similar to the CMV for this
feature.
The second approach is to set a cut point tolerance range (CPTR) for each of the medical
measures. Often a medical measure has a clinically relevant cut-point, which establishes a
threshold between healthy and un-healthy values. For example, the National Heart Lung and
Blood Institute Obesity Education Initiative defined six classifications for body mass index
(BMI). These classifications are cut points ranging from less than 18.5 kg/m2 for underweight,
18.5 - 24.9 kg/m2 for normal weight, to greater than or equal to 40 kg/m2 for extreme obesity
[20]. Thus, for a BMI value of 22 the CPTR Tj = (18.5, 24.9).
Both the PBTR and the CPTR approaches can be applied to interval and ratio data. For
ordinal data, a tolerance range can be chosen as a range on the ordinal scale of potential values.
For example, Figure 3 illustrates a question on mental health. The responses are ordered in
ascending order of intensity. If the CMV includes item response two to this question, than
grouping would be with people who have the same response or perhaps a subset of the possible
categories. For instance, the tolerance set might be categories 2 and 3, represented as Tj = (2, 3).
Figure 3. Ordinal measurement type.
For nominal data, there are two approaches. First, each response category of a nominal
data item may be converted into an independent item. For example, if the nominal data item is a
checklist of the prescription medications used by an individual this can be converted into 10
binary data items on the usage of each specific medication (e.g., using Lipitor / not using Lipitor,
using aspirin / not using aspirin). Disuniting each element of a nominal data item in this manner
has the possibility of overwhelming the similarity computation. An alternative approach for
nominal data is to associate a tolerance with this feature such as "X out of the Y nominal
categories must be the same" for the binary data item to show agreement. This would preclude
the possibility of overwhelming the similarity computation by a disunited single nominal
variable.
3.4.2 Similarity Computation
For a dataset of individuals I = {I1, I2, … In} each with a set of measurements M = {m1,
m2, … mk} an NxN similarity matrix can be computed between each pair of objects. This is
O(N2). The computation can be simplified under three conditions. First, pair-wise computation
of an object with itself is (i.e., on the diagonal) is not needed. Second, it is reasonable to assume
that there is a symmetric relationship between two objects, thus S(A,B) = S(B, A). Under these
two conditions, the computation is reduced to the lower half of the matrix and thus there are
7
8. N2 N
computations. Note, that the objective of this work is to match similar individuals. As
2
such, the computation can be reduced to O(N) since only the similarity coefficient between the
CMV and the list of individuals is computed.
3.4.3 Simulation Dataset
The United States National Institutes of Health (NIH) and the United States Centers for
Disease Control and Prevention (CDC) operate clinical trials, cross-sectional studies, and
surveillance activities either through intramural or extramural research. For the purposes of this
work the dataset must be public use, contain a large number of individuals, and contain a variety
of measures. Therefore, data from the National Health and Nutrition Examination Survey
(NHANES) have been selected.
NHANES is a nationally representative cross-sectional survey of the non-institutionalized
population of the United States. Each year the NHANES enrolls approximately 5,000 individuals
of all age ranges, genders, race, and ethnicities. Study participants participate in an interview in
their home. After the home interview, a participant receives an extensive physical exam at one of
three mobile examination centers. Content on the study includes cardiovascular disease,
environmental exposures, eye disease, kidney disease, obesity, physical fitness, physical
functioning, and many other health indicators [21, 22].
3.4.4 Missing Data
Surveys such as NHANES may have missing data for some individual's measurements.
This can arise because individuals refuse to participate in the survey or because they refuse to
participate in portions of the survey [23]. Missing data affects two elements of the computational
model. First, it affects the data reduction piece, as PCA requires complete records for
computation. However, PCA will automatically remove incomplete records to determine the
variance structure.
Secondly, similarity computation needs to account for missing data. Conceptually, it is
unknown if a measurement is missing because it was never observed or recorded, it is a feature
that does not exist for an individual, or some other reason. The reasons for missing data are not
encoded in the NHANES database and therefore it cannot be concluded that a person with a
missing measurement has a value similar to the CMV. In this research, missing data is re-coded
to NULL and is considered different from another person’s measurement.
3.5 SHN Simulation
Publicly available data from NHANES 1999-2003 is used in the simulation. The dataset
includes 31,124 individuals at birth age and older. This dataset comprises measures related to
self-report questions on health, physical measures, and the results of laboratory tests [24, 25, 26].
The simulation is evaluated on type 2 diabetes. Tables 3 and 4 describe the data items and the
data files used for the simulation.
3.5.1 Type 2 Diabetes
Type 2 diabetes (T2D) usually occurs in individuals who are older, obese, or lacking in
physical activity. It occurs as insulin resistance such that the muscle, liver, and fat cells do not
use insulin properly. As a result, the body needs additional insulin to get glucose into cells for
8
9. energy [27]. T2D can be controlled with healthy eating habits, physical activity, weight loss, and
for some individuals, with the use of medications [28].
A primary risk factor for T2D is age, with those individuals over 45 being at increased
risk. Some other risk factors associated with type 2 diabetes are abdominal obesity, ethnicity,
HDL values lower than the normal range, history of gestational diabetes, hypertension, insulin
resistance, overweight, physical inactivity, and a family history of diabetes [29, 30].
Symptoms of T2D include infections, blurry vision, and tingling or numbness in the
hands and feet [31]. There are numerous health effects resulting from diabetes such as cataracts,
glaucoma, or retinopathy; foot ulcers, amputations; hearing loss; heart disease, or hypertension;
nervous system diseases; skin infections; or stroke [31, 32, 33].
Diabetes is diagnosed with a fasting plasma glucose (FPG) test, a regular plasma glucose
test or an oral glucose tolerance test (OGTT). All three tests assess the level of glucose in the
blood. A normal value is less than 100 mg/dL for people without diabetes. Values between 100
and 125 mg/dL is labeled as "impaired fasting glucose", while values greater 125 mg/dL are given
a label of "provisional diagnosis of diabetes". A non-fasting plasma glucose test may also be
used. If the value from this test is above 200 mg/dL, than an individual may have diabetes.
Confirmatory tests are usually required [34, 35, 36].
T2D is monitored with laboratory tests such as total cholesterol, HDL cholesterol, LDL
cholesterol, triglycerides, and insulin [37]. Many of the T2D related self-reported questions,
physical measures, and laboratory tests are available in the NHANES dataset.
3.5.2 Simulation Software
The computational model and software for the simulation runs on a Hewlett-Packard
model p6210y personal computer with an AMD Athlon ™ II X4 620 Processor. The processor
runs at 2.60 GHz and there is 6GB of installed RAM. Windows 7 64-bit operating system is
installed on the personal computer. Filtering and data reduction is computed with software
written in the SAS Statistical Software v9.1. Similarity is computed with software written in
Java.
4 Results
The dataset was prepared by merging several datasets from NHANES 1999-200,
NHANES 2001-2002, and NHANES 2003-2004. As shown in Table 3, the dataset includes 28
medical measurements. One can imagine that a research scientist studying T2D would select the
items in this dataset. Perhaps the patient population would select items related to family history
and pain. Therefore, both the researcher and patients can influence the matching process without
affecting the conceptual model.
The simulation is examined from two different perspectives: 1) reduction in the record
and feature space resulting from filtering and PCA, and 2) the correlation between the three
similarity coefficients.
4.1 Filtering
T2D occurs mostly in adults, thus the datasets were filtered in the first stage for
individuals ages 20 and above. This resulted in the original dataset of 31,124 individuals being
reduced to 49.2% of the original size. The dataset does not include temporal information due to
confidentiality and disclosure concerns. Therefore, temporal matching is not utilized for this
problem.
9
10. 4.2 Data Reduction
The second step in the process is to conduct the principal component analysis (PCA) to
reduce the scale of the feature space (i.e., medical measures). Figure 4 shows the value of the
principal components for T2D. The first 11 principal components (PC) are greater than 1.0.
Figure 5 shows the unique and cumulative proportion that each PC contributes to the overall
variance. The first 11 PCs uniquely contribute between 3.8% and 12.2% of the overall variance.
In addition, the T2D PCs cumulatively contribute 70.7% to the overall variance. Thus, following
the criteria for selection of PCs the first 11 T2D PCs are used for data reduction.
6
4
2
0
1 5 9 13 17 21 25
Principal Component #
Figure 4. Type 2 diabetes principal component values.
100%
80%
60% Unique Proportion
40% Cumulative Proportion
20%
0%
1 5 9 13 17 21 25
Principal Component
Figure 5. Type 2 diabetes principal component unique and cumulative
proportions.
Figure 6 shows 18 of the original 28 measures related to T2D. Fourteen of these
measures have a loading of 0.70 or greater on a PC. Four measures are loaded very close to 0.70
and are thus retained. Thus, PCA reduces the measurement space for T2D by 35.7%.
10
11. 1
0.8
0.6
0.4
0.2
0
LBXSKSI
LBXSGL
LBXGLU
BPXSY1
URXUMA
DIQ070
DIQ080
LBXHCT
LBDLDL
LBDHDL
LBXSPH
LBXHGB
BMXBMI
FAMDIA
LBXGH
LBXTC
LBXTR
BMXWAIST
Figure 6. Type 2 diabetes loadings.
4.3 Similarity
Similarity coefficients are computed in the third step of the model. TFCMC, JC, and GC
are used. For the TFCMC and JC the binary datasets are computed with PBTRs of 5%, 10%,
25%, and 50% as shown in Table 1. The PBTRs for each measurement (i.e., variable) are
calculated as described in Equation 5. Thus, in the T2D example the CMV has a body mass index
measurement (BMI) of 27 and a 5% PBTR of (26.201, 28.959). As the tolerance level increases
the tolerance range around each measure becomes larger. For categorical data, individual
categories may be selected; for ordinal data, ranges may be selected. FAMDIA is an example of a
categorical measurement, which can be coded, with value of zero or one. A zero represents a
CMV without a family history of diabetes. In T2D example, the FAMDIA PBTR range across all
tolerance levels is essentially (0,0).
For the CPTR approach, the tolerance range used is one that is medically relevant. For
example, the CMV has a BMI measurement of 27.5 and a systolic blood pressure reading of 120.
The literature describes a BMI of 27.5 to be in the overweight classification range of 25.0 – 29.9
[20]. Thus, the CPTR for BMI is (25, 29.9). Similarly, systolic blood pressure is considered
normal if it is less than or equal to 120 mmHg. The CMV blood pressure is exactly 120, so the
CPTR can be set as less than or equal to 120 mmHg.
Table 1 also delineates the CPTRs. For several measurements, the literature describes a
CPTR delineating healthy and unhealthy levels (refer to the references noted in Table 1). Some
measurements do not have a specific set of cut points for healthy and unhealthy values. Instead,
these measurements have a reference range that denotes where the values of the measurement fall
for a large percentage of the population. All reference ranges for these measurements are
consistent with the CMV age and are inclusive of differences between males and females.
Table 2, Figure 7, and Figure 8 illustrate the descriptive statistics for the example.
TFCMC can produce negative similarity scores when the majority of measurements between the
CMV and an individual are dissimilar. In both examples, the similarity score at each percentile
increases as the PBTR tolerance level increases. For example, at the 5% tolerance level and 95th
percentile, the TFCMC similarity score results in a value of negative six; and at the 50%
tolerance level and 95th percentile TFCMC has a similarity score of 14. Thus, higher similarity
scores occur by increasing the tolerance level around a measurement. One must be careful in
setting the tolerance level because high similarity scores can result between the CMV and an
individual who is in all likelihood dissimilar. In addition, the cut point tolerance ranges produce
similarity scores at the different percentiles that fall between the 10% and 50% tolerance level.
11
12. 20
15
10
Similarity Score
5
0
-5
-10
-15
-20
Min 25th % 50th % 75th % 95th % Max Mean
5% PBTR 10% PBTR 25% PBTR
50% PBTR CPTR
Figure 7. Type 2 diabetes TFCMC descriptive similarity statistics.
1
0.9
0.8
Similarity Score
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Min 25th % 50th % 75th % 95th % Max Mean
5% PBTR 10% PBTR 25% PBTR
50% PBTR CPTR GC
Figure 8. Type 2 diabetes JC and GC descriptive similarity statistics.
Figure 9 illustrates the correlation coefficients between each combination of similarity
coefficients at PBTRs of 5%, 10%, 25%, and 50% and the correlation coefficient for the CPTR.
This figure shows that the correlation strength between (TFCMC, GC) and (JC, GC) increases as
the tolerance level increases. Note however that (TFCMC, JC) are strongly correlated at all
PBTRs and the CPTR. For (TFCMC, GC), and (JC, GC) the correlation coefficient for CPTR is
between the correlation coefficients at the 25% and 50% tolerance levels.
12
13. 1.0
0.8
0.6
R
0.4
0.2
0.0
TFCMC, JC TFCMC, GC JC, GC
5% 10% 25% 50% CPTR
Figure 9.Correlation coefficients associated with type 2 diabetes similarity
coefficients.
5 Discussion
The purpose of this paper is to propose a conceptual model for grouping similar
individuals together, based on their medical measurements, and demonstrate it with an example.
The conceptual model consists of candidate measurement vector (CMV) selection, rule-based
filtering, principal component analysis (PCA) data reduction, and similarity computation.
Different techniques for computing similarity were compared. This research is significant
because, to date, a conceptual model for the purpose of automatically grouping individuals for
health research has not been defined.
The simulation uses a publicly available dataset and successfully demonstrates that the
scale of the problem, in terms of the number of observations and feature space, can be reduced
using filtering and principal component analysis (PCA). In the example chosen, filtering for a
specific age range reduced the number of observations by about one-half. This will vary based on
the filtering critera and the population of individuals in the dataset. The feature space was
reduced from 28 to 18 medical measurements using PCA, a reduction of 35.7%.
The mean similarity scores for TFCMC, JC, and GC all increased as the PBTR increased.
The increased scores imply a higher degree of likeness between the CMV and each of the other
observations (i.e., individuals) in the dataset. The mean similarity score for TFCMC is low at all
PBTRs and with the CPTR. It should be noted however, that a higher similarity score is balanced
against the tolerance level used with PBTRs. Using a high tolerance level may in practice bring
dissimilar individuals together into an RHN. Therefore, caution is recommended in setting the
tolerance level.
The strong correlation between TFCMC and JC is an unexpected finding as TFCMC
lowers the similarity score due to disimilar measurements. However, JC in some sense takes
dissimilar features into account as well in the denominator (refer to equation 1). Thus, the two
similarity coefficients track together and are thus correlated. This may not be the case however
if TFCMC is weighted.
Similarity computation showed strong positive correlations between JC and GC for
PBTRs of 10%, 25%, 50%, and CPTRs. For PBTRs the correlation was a little bit below a
moderate correlation. Using a 50% PBTR is not likely to be a good approach as it may result in
13
14. ranges that cross over many cut points of healthy values for a specific medical measurement. For
example, one person with an unhealthy blood pressure level might pool with people who have
healthy blood pressure levels.
The similarity score results highlight two points. First, in the case of TFCMC and JC it is
important to establish a threshold similirty score for grouping individuals. This may be based on
a minimum number of measurements that are considered the same. Arbitrary assignment of a
threshold value should be avoided. Intuitively, one might consider that at least half the
measurements should be equivalent. This would establish a TFCMC floor of zero and a JC floor
of 0.50. An alternative approach is to consider the statistical distribution of the similarity scores
and choose those scores at the 95th percentile or higher. In practice, the assignment of a
threshold may be based on empirical evidence. Secondly, GC scales each measurement by the
range and is conceptually appealing as it is desgiend to work with mixed data types. It is true that
as the GC score increases two individuals are considered more similar. However, it is not clear
how the scores are to be interpreted and thus GC presents a problem. Moreover, the
interpretation of the GC similarity score is not as intuituive as JC and TFCMC.
6 Conclusion
Developing a conceptual model for matching individuals with the appropriate research
program is an important contributor to improving the research process and engaging individuals.
While research programs have selected individuals for participation in their programs for many
years, it is plausible to re-think this approach to improve matching of a study respondent and
researcher. Therefore, this paper proposes a conceptual model that automatically groups
individuals based on a filtering the data space, reducing the feature space with PCA, and
computing the likeness between individuals with similarity coefficients. An example was used to
simulate the conceptual model, and illustrate the effectiveness of filtering and PCA in reducing
the scale of the problem. Based on the results, two next steps include evaluation of the
conceptual model with a large-scale problem and temporal filtering to refine the matching.
References
[1] Blumenthal D. Launching HITECH. New England Journal of Medicine, vol. 362, no. 5,
February 2010, pp. 382-385.
[2] Halamka JD, Mandl KD, Tang PC. Early experiences with personal health records.
Journal of the American Medical Informatics Association, vol. 15, no. 1, Jan / Feb 2008, pp. 1-7.
[3] Krantz DH, Luce RD, Suppes P, and Tversky A. Foundations of Measurement: Volume
1, Additive and Polynomial Representations. Dover Publications, Mineola, NY, 1999.
[4] McCall RB. Fundamental Statistics for Psychology. Second Edition, Harcout Brace
Jovanovich, Inc. New York. 2nd Edition, 1975. pp. 6-9.
[5] Friedman CP, Wyatt JC. Evaluation Methods in Medical Informatics. Springer-Verlag,
New York, 1997, pp. 107-108.
[6] Fodor IK. A Survey of Dimension Reduction Techniques. U.S. Department of Energy,
Lawrence Livermore National Laboratory, UCRL-ID-148494. May, 9, 2002.
[7] Dunteman GH. Principal Component Analysis, Series: Quantitative Applications in the
Social Sciences, Sage Publications, 1989, Newbury Park, CA.
[8] Gower JC. A general coefficient of similarity and some of its properties. Biometrics,
December 1971, vol. 27, pp. 857-874.
14
15. [9] Yin Y and Yasuda K. Similarity coefficient methods applied to cell formation problem: a
comparative investigation. Computers & Industrial Engineering, 2005, vol. 48,pp. 471-489.
[10] Reif, JC, Melchinger, AE, Frisch, M. Genetical and Mathematical Properties of Similarity
and Dissimilarity Coefficients Applied in Plant Breeding and Seed Bank Management Crop Sci,
2005, vol. 45, pp. 1-7.
[11] Willett P. Similarity-based virtual screening using 2D fingerprints. Drug Discovery
Today, December 2006, vol 11, no. 23/24, pp. 1046-1053.
[12] Kosman E., Leonard KJ. Similarity coefficients for molecular markers in studies of
genetic relationships between individuals for haploid, diploid, and ployploid species. Molecular
Ecology, 2005, vol. 14, pp. 415-424.
[13] Goodall DQ. A new similarity index based on probability. Biometrics, December 1966,
pp. 882-907.
[14] Jaccard P. The distribution of the flora in the alpine zone. The New Phytologist, vol. XI,
no. 2, pp. 37-50, Feb. 1912.
[15] Alderderfer MS and Blashfield RK. Cluster Analysis, Series: Quantitative Applications
in the Social Sciences. Series/Number 07-044. Newbury Park: Sage Publications, 1984.
[16] Tversky A. Features of Similarity. Psychological Review, July 1977, vol. 84, no. 4, pp.
327 – 352.
[17] National Cholesterol Education Program. Detection, Evaluation, and Treatment of High
Cholesterol in Adults (Adult Treatment Panel III): Executive Summary. U.S. Department of
Health and Human Services, NIH Publication No. 01-3670, May 2001, pp. 3.
http://www.nhlbi.nih.gov/guidelines/cholesterol/atp3xsum.pdf. Accessed on April 6, 2010.
[18] Pett MA, Lackey NR, Sullivan JJ. Making Sense of Factor Analysis: The Use of Factor
Analysis for Instrument Development in Health Care Research. Sage Publications, Thousand
Oaks, California, 2003.
[19] Goddard J and Kirby A. An introduction to factor analysis. Norwich, UK: Geo
Abstracts, 1976.
[20] The Practical Guide Identification, Evaluation, and Treatment of Overweight and Obesity
in Adults. U.S. Department of Health and Human Services, Public Health Service, National
Institutes of Health, National Heart, Lung, and Blood Institute. NIH Publication No. 00-4084.
October 2000. Available at http://www.nhlbi.nih.gov/guidelines/obesity/prctgd_c.pdf. Accessed
on January 4, 2011.
[21] About the National Health and Nutrition Examination Survey (NHANES). United States
Centers for Disease Control and Prevention, National Center for Health Statistics.
http://www.cdc.gov/nchs/nhanes/about_nhanes.htm. Accessed on April 6, 2010.
[22] National Health and Nutrition Examination Survey: 1999-2010 Survey Content. United
States Centers for Disease Control and Prevention, National Center for Health. Statistics.
http://www.cdc.gov/nchs/data/nhanes/survey_content_99_10.pdf. Accessed April 6, 2010.
[23] Brick JM and Kalton G. Handling missing data in survey research. Stat Methods Med
Res. September 1996, vol. 5, pp. 215-238.
15
16. [24] National Health and Nutrition Examination Survey: NHANES 1999-2000. Centers for
Disease Control and Prevention. http://www.cdc.gov/nchs/nhanes/nhanes1999-
2000/nhanes99_00.htm. Accessed on January 4, 2011.
[25] National Health and Nutrition Examination Survey: NHANES 2001-2002. Centers for
Disease Control and Prevention. http://www.cdc.gov/nchs/nhanes/nhanes2001-
2002/nhanes01_02.htm. Accessed on January 4, 2011.
[26] National Health and Nutrition Examination Survey: NHANES 2003-2004. Centers for
Disease Control and Prevention. http://www.cdc.gov/nchs/nhanes/nhanes2003-
2004/nhanes03_04.htm. Accessed on January 4, 2011.
[27] Diagnosis of Diabetes. National Institutes of Health, National Institute of Diabetes and
Digestive and Kidney Diseases. http://diabetes.niddk.nih.gov/dm/pubs/diagnosis/index.htm.
Accessed on January 4, 2011.
[28] National Diabetes Fact Sheet, 2007. Centers for Disease Control and Prevention.
http://www.cdc.gov/diabetes/pubs/pdf/ndfs_2007.pdf. Accessed on January 4, 2011.
[29] Medline Plus: Type 2 Diabetes - Risk Factors. National Institutes of Health, National
Library of Medicine. http://www.nlm.nih.gov/medlineplus/ency/article/002072.htm. Accessed
on January 4, 2011.
[30] Diabetes Health Center: Risk Factors for Diabetes. WedMD.
http://diabetes.webmd.com/risk-factors-for-diabetes. Accessed on January 4, 2011.
[31] Diabetes Basics: Symptoms. American Diabetes Association.
http://www.diabetes.org/diabetes-basics/symptoms/. Accessed on January 4, 2011.
[32] Living with Diabetes: Complications. American Diabetes Association.
http://www.diabetes.org/living-with-diabetes/complications/. Accessed on January 4, 2011.
[33] Complications of Diabetes. National Institutes of Health, National Institute of Diabetes
and Digestive and Kidney Diseases. http://diabetes.niddk.nih.gov/complications/. Accessed on
January 4, 2011.
[34] Diabetes Guide: Diabetes Testing. WebMD.
http://diabetes.webmd.com/guide/diagnosing-type-2-diabetes. Accessed on January 4, 2011.
[35] Mayfield, J. Diagnosis and Classification of Diabetes Mellitus: New Criteria. American
Family Physician. http://www.aafp.org/afp/981015ap/mayfield.html. Accessed on January 4,
2011.
[36] American Diabetes Association. Position Statement: Diagnosis and Classification of
Diabetes Mellitus. Diabetes Care. Volume 27, Supplement 1, January 2004, pp. s5-s10.
http://care.diabetesjournals.org/content/27/suppl_1/s5.full.pdf+html. Accessed on January 4,
2011.
[37] Diabetes. Lab Tests Online.
http://www.labtestsonline.org/understanding/conditions/diabetes-6.html. Accessed on January 4,
2011.
[38] Healthy Weight - it's not a diet, it's a lifestyle!: About BMI for Adults.
http://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html. Accessed on January 4,
2011.
16
17. [39] Weight-control Information Network: Weight and Waist Measurement: Tools for Adults.
National Institutes of Health, National Institute of DIabetes and Digestive and Kidney Disease.
http://www.win.niddk.nih.gov/publications/tools.htm#circumf. Accessed on January 4, 2011.
[40] Medline Plus: High Blood Pressure. National Institutes of Health, National Library of
Medicine. http://www.nlm.nih.gov/medlineplus/highbloodpressure.html. Accessed on January 4,
2011.
[41] Tietz NW. Clinical Guide to Laboratory Tests. 3rd Edition. Edited by Norbert W.
Tietz. W. B. Saunders Company, Philadelphia, 1995.
[42] Diabetes Health Center: Blood Glucose. WebMD. http://diabetes.webmd.com/blood-
glucose?page=3. Accessed on January 4, 2011.
[43] Diabetes Health Center: Microalbumin Urine Test. WebMD.
http://diabetes.webmd.com/microalbumin-urine-test?page=2. Accessed on January 4, 2011.
[44] Diabetes Health Center: Hyperglycemia and Diabetes. WebMD.
http://diabetes.webmd.com/diabetes-hyperglycemia. Accessed on January 4, 2011.
17
18. APPENDIX A
Table 1. Percentage-based and clinically relevant cut-point tolerance ranges for type 2 diabetes measures.
τ = 5% τ = 10% τ = 25% τ = 50%
Variable CMV Value Cut-Point Tolerance Ranges Min Max Min Max Min Max Min Max
BMXBMI 27.5 ≥ 25 is overweight, thus 25 - 29.9 is used [20]. 26.2 28.9 24.8 30.3 20.6 34.4 13.7 41.3
Higher risk category is ≥ 88 for women and ≥ 101 for
BMXWAIST 101 95.9 106.0 90.9 111.1 75.7 126.2 50.5 151.5
men, thus ≥ 88 is used [38, 39].
BPXSY1 120 ≤ 120 normal [40] 114 126 108 132 90 150 60 180
LBXGH 7.4 > 5.2% [41] 7.0 7.77 6.6 8.14 5.5 9.25 3.7 11.1
LBXGLU 178.4 > 99 is abnormal [42] 169.4 187.3 160.5 196.2 133.8 223 89.2 267.6
LBXTC 167 < 200 is normal [41] 158.6 175.3 150.3 183.7 125.2 208.7 83.5 250.5
LBDHDL 32 < 35 is at risk [41] 30.4 33.6 28.8 35.2 24 40 16 48
LBXTR 218 < 250 is desirable [41] 207.1 228.9 196.2 239.8 163.5 272.5 109 327
LBDLDL 92 < 130 is desirable [41] 87.4 96.6 82.8 101.2 69 115 46 138
URXUMA 26.4 ≥ 20 is abnormal [43] 25.0 27.7 23.7 29.0 19.8 33 13.2 39.6
LBXSGL 179 > 180 is abnormal [44] 170.0 187.9 161.1 196.9 134.2 223.7 89.5 268.5
Reference range is 2.8 - 4.1 for women and 2.3 - 3.7
LBXSPH 2.6 2.4 2.7 2.34 2.8 1.95 3.25 1.3 3.9
for men. Thus, 2.3 - 4.1 is used [41].
LBXSKSI 3.8 Reference range is 3.5 - 5.1 [41]. 3.6 4.0 3.5 4.27 2.91 4.86 1.9 5.8
Reference range is 11.7-16.0 for women and 13.1-17.2
LBXHGB 17 16.1 17.8 15.3 18.7 12.7 21.2 8.5 25.5
for men. Thus, 11.7 - 17.2 is used [41].
Reference range is 35-47 for women and 39 - 50 for
LBXHCT 51.1 48.5 53.6 45.9 56.2 38.3 63.8 25.5 76.6
men. Thus, 35 - 50 is used [41].
DIQ080 1 1 1 1 1 1 1 1 1 1
DID060MN 0 0 0 0 0 0 0 0 0 0
FAMDIA 0 0 0 0 0 0 0 0 0 0
18
20. Recoded to months on insulin
7 How long taking insulin DIQ060U/Q DIQ060U/Q DIQ060U/Q which is measure variable
name DID060MN
Take diabetic pills to lower blood
8 DIQ070 DIQ070 DIQ070
sugar
Diabetes affected eyes / had
9 DIQ080 DIQ080 DIQ080
retinopathy
Ulcers / sores not healed within 4
DIA090 DIA090 DIA090
weeks
Numbness in hands / feet past 3
DIQ100 DIQ100 DIQ100
months
Numbness in hands / feet or both DIQ110 DIQ110 DIQ110 Merged into 1 data item
10 reflecting pain / numbness /
Pain in hands / feet past 3 months DIQ120 DIQ120 DIQ120 tingling
Where was pain or tingling DIQ130 DIQ130 DIQ130
Pain in either leg while walking DIQ140 DIQ140 DIQ140
Questionnaire Pain in calf or calves DIQ150 DIQ150 DIQ150
Data
Mother with diabetes MCQ260AA MCQ260AA MCQ260AA
Father with diabetes MCQ260AB MCQ260AB MCQ260AB
Mat. grandmother with diabetes MCQ260AC MCQ260AC MCQ260AC
Pat. grandmother with diabetes MCQ260AE MCQ260AE MCQ260AE Merged into 1 data item
11 Mat. grandfather with diabetes MCQ260AD MCQ260AD MCQ260AD reflecting family history of
Pat. grandfather with diabetes MCQ260AF MCQ260AF MCQ260AF diabetes
Brother with diabetes MCQ260AG MCQ260AG MCQ260AG
Sister with diabetes MCQ260AH MCQ260AH MCQ260AH
Other relative with diabetes MCQ260AI MCQ260AI MCQ260AI
Mother with hypertension MCQ260FA MCQ260FA MCQ260FA
Merged into 1 data item
12 Father with hypertension MCQ260FB MCQ260FB MCQ260FB reflecting family history of
Mat. grandmother with hypertension
MCQ260FC MCQ260FC MCQ260FC
hypertension
20
21. Pat. grandmother with hypertension MCQ260FE MCQ260FE MCQ260FE
Mat. grandfather with hypertension MCQ260FD MCQ260FD MCQ260FD
Pat. grandfather with hypertension MCQ260FF MCQ260FF MCQ260FF
Brother with hypertension MCQ260FG MCQ260FG MCQ260FG
Sister with hypertension MCQ260FH MCQ260FH MCQ260FH
Other relative with hypertension MCQ260FI MCQ260FI MCQ260FI
13 Told to take medicine for BP BPQ040A BPQ040A BPQ040A
14 Glycohemoglobin LBXGH LBXGH LBXGH
15 High Density Lipoprotein LBDHDL LBDHDL LBXHDD
16 Hematocrit LBXHCT LBXHCT LBXHCT
17 Hemoglobin LBXHGB LBXHGB LBXHGB
18 Hepatitis C LBDHCV LBDHCV LBDHCV
19 Insulin LBXIN LBXIN LBXIN
20 Low Density Lipoprotein LBDLDL LBDLDL LBDLDL
Laboratory
21 Phosphorus LBXSPH LBDSPH LBXSPH
Data
22 Plasma Glucose LBXGLU LBXGLU LBXGLU
23 Potassium LBXSKSI LBXSKSI LBXSKSI
24 Serum Glucose LBXSGL LBXSGL LBXSGL
25 Total Cholesterol LBXTC LBXTC LBXTC
26 Triglyceride LBXTR LBXTR LBXTR
27 Urine Albumin URXUMA URXUMA URXUMA
28 White Blood Cell Count LBXWBCSI LBXWBCSI LBXWBCSI
21