2. Q:1
Clean the data. Remove zombies (visits from dead patients), remove diagnoses
before birth (negative age of diagnosis), patients with more than 365 diagnosis per
year of age, diagnoses after age of 110, no age of diagnosis, no diagnosis, and less
than 365 days between first and last diagnosis. Report the number of patients and
diagnoses excluded from the study as you cleaned the data
3. 17,443,442
Dx
Removed wrong
data(Patients had visits
after their death )
168 IDs
Removed patients
had > 365 Dx/ y
56 IDs
Removed Patients
had < 365 days
( first and last Dx)
829,632 IDs
Removed from AgeAtDx:
1. Null values
2. Negative values
3. AgeAtDx > = 110
4. ICD9 is NULL
#DATACLN
17,379,218 Dx
4. Q:2
For patients who have diabetes, identify the average age of diabetes and standard
deviation of the age of diabetes. Select a baseline period (e.g. diagnoses before 50)
that is relatively diabetes free. Exclude patients who had diabetes during the
baseline period. Report the baseline period and exclusions
5. Calculate the AVG and STDED for
patients with Diabetes
AvgAge SdDevAge
62.2 12.4
6. ο± Baseline period is when
AgeAtFirstDM <= 50
ο± Number of patients who relatively
free of DM and should be excluded
is: 40,432 ID
ο± The number of patients with diabetes
for patients with age GREATER
than 50 and after removing the
NULL values is:
4,519,842 Dx
194,157 ID
7. Q:3
Calculate the likelihood ratio of diabetes for each diagnosis from 90% randomly
selected training set. The medical history for diabetic patients is any diagnosis that
precedes diabetes in the baseline period. For non-diabetic patients, it is any
diagnoses non-diabetic patients had in the baseline period. In this analysis we are
doing a case control study, where cases are patients with diabetes and controls are
patients without diagnosis.
8. Training Set and Validation Set
TS β90%
WHERE Rand(ID) <=.9
15,566,207 Dx
743,734 ID
VS-10%
FROM #DATACLN a left join
#TrainID b
1,813,011 Dx
85,843 ID
9. Likelihood Ratio
πΏπ =
Number of Patients Dx with DM T otal ππ’ππππ ππ πππ‘ππππ‘π π€ππ‘β π·π
Number of Patient with Dx and NOT D M T otalnumber without DM
12. Q:4
Use the likelihood ratios for the medical history to predict probability of diabetes in
10% set aside validation set.
π·ππππππππππ ππ π«π΄ =
π·ππππππππ πΆπ π π
π + π·ππππππππ πΆπ π π
π·ππππππππ ππ π π = π·ππππ ππ π π β π΄π΄ π°ππ ππ
16. Sensitivity : The true positive rate, measures the proportion of positives that
are correctly identified as such (i.e. the percentage of sick people , with DM,
who are correctly identified as having the condition).
Sensitivity refers to the test's ability to correctly detect patients who do have
the condition
17. Specificity of a test is the proportion of healthy patients known not to have
the disease, who will test negative for it.
Specificity: The true negative rate, measures the proportion of negatives that
are correctly identified as such (i.e., the percentage of healthy people (Donβt
have DM) who are correctly identified as not having the condition).
18. Contingency Table
A contingency table is a type of table in a matrix format that
displays the (multivariate)frequency distribution of the
variables. They are heavily used in survey research, business
intelligence, engineering and scientific research. They provide
a basic picture of the interrelation between two variables and
can help find interactions between them.
20. Receiver Operating Characteristic
(ROC) Curve
The graph at right showing the number of patients
with and without a disease arranged according to the
value of a diagnostic test. This distributions overlap--
the test (like most) does not distinguish normal from
disease with 100% accuracy. The area of overlap
indicates where the test cannot distinguish normal
from disease.
22. Q:6
Is the model accurate enough to guide individual patients? Public policy? Prepare a
presentation about the value of the predictive medicine and your effort.
23. Conclusion
ο± Predictive medicine is a branch of medicine that aims to identify patients at risk
of developing a disease, thereby enabling either prevention or early treatment of
that disease
ο± To guide individual patients: The likelihood that a given test result correlates with
the presence or absence of disease
ο± For public policy: Use Probability. Probability is a method to describe the
likeliness that an event will occur
Editor's Notes
Values between 0 and 1 decrease the probability of disease, and the values greater than 1 increase the probability of disease
The first thing to realize about LR's is that an LR > 1 indicates an increased probability that the target disorder is present, and an LR < 1 indicates a decreased probability that the target disorder is present. Correspondingly, an LR = 1 means that the test result does not change the probability of disease at all!
Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. Probability is defined as the fraction of desired outcomes in the context of every possible outcome with a value between 0 and 1, where 0 would be an impossible event and 1 would represent an inevitable event.
Odds can have any value from zero to infinity and they represent a ratio of desired outcomes versus the field. Odds are a ratio, and can be given in two different ways: βodds in favorβ and βodds againstβ. βOdds in favorβ are odds describing the if an event will occur, while βodds againstβ will describe if an event will not occur.
The likelihood ratios can be used to compute posttest probability of disease. They are more useful than sensitivity and specificity in that they can be used for diagnostic tests with more than two results, they can more easily be applied to a series of diagnostic tests, their values convey intuitive meaning and the likelihood ratio form of Bayes theorem is easier to remember.
Pre-test probability and post-test probability (alternatively spelled pretest and posttest probability) are the probabilities of the presence of a condition (such as a disease) before and after a diagnostic test, respectively. Post-test probability, in turn, can be positive or negative, depending on whether the test falls out as a positive test or a negative test, respectively. In some cases, it is used for the probability of developing the condition of interest in the future.
Test, in this sense, can refer to any medical test (but usually in the sense of diagnostic tests), and in a broad sense also including questions and even assumptions (such as assuming that the target individual is a female or male). The ability to make a difference between pre- and post-test probabilities of various conditions is a major factor in the indication of medical tests.
When many patients with disease have a negative test (false negatives) the SN decrease. The test fails to identify a symptomatic patients.
When many disease free patients have a positive test ( false positive) the SP decrease.. We donβt want that
True positive: Sick people correctly identified as sick
False positive: Healthy people incorrectly identified as sick
True negative: Healthy people correctly identified as healthy
False negative: Sick people incorrectly identified as healthy
We choose a cutpoint (indicated by the vertical black line) above which we consider the test to be abnormal and below which we consider the test to be normal. The position of the cutpoint will determine the number of true positive, true negatives, false positives and false negatives. We may wish to use different cutpoints for different clinical situations if we wish to minimize one of the erroneous types of test results.
We will plot of the true positive rate against the false positive rate for the different possible cutpoints of a diagnostic test.
The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question.
an expression of the likelihood that a given test result correlates with the presence or absence of disease. A positive predictive value is the ratio of patients with the disease who test positive to the entire population of those with a positive test result; a negative predictive value is the ratio of patients without the disease who test negative to the entire population of those with a negative test result.
Predictive medicine is a branch of medicine that aims to identify patients at risk of developing a disease, thereby enabling either prevention or early treatment of that disease.