Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Health Outcomes Report Summary by Best Practices, LLC 744 views
- JSBMarketResearch: PharmaSphere: No... by kalyaniroy977 220 views
- The Role of RWE in Drug Development... by Billy Franks 353 views
- Health Technology Assessments: The ... by IMS Health Asia P... 662 views
- Health Outcomes Report Summary by Marty Daniel with... 311 views
- EUnetHTA Training course for Stakeh... by EUnetHTA 810 views

1,621 views

Published on

No Downloads

Total views

1,621

On SlideShare

0

From Embeds

0

Number of Embeds

134

Shares

0

Downloads

19

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Clinical Trials Versus Health Outcomes Research: SAS/STAT Versus SAS Enterprise Miner Patricia B. Cerrito [email_address] University of Louisville
- 2. Objectives <ul><li>To examine some issues with traditional statistical models and their basic assumptions </li></ul><ul><li>To examine the Central Limit Theorem and its necessity in statistical models </li></ul><ul><li>To look at the differences and similarities between clinical trials and health outcomes research </li></ul>
- 3. Surrogate Versus Real Endpoints <ul><li>Because clinical trials tend to be short term, they use high risk patients and surrogate endpoints </li></ul><ul><li>Use of statins reduce cholesterol levels but do they increase longevity and disease free survival? </li></ul><ul><li>Health outcomes data can examine real endpoints from the general population </li></ul>
- 4. One Versus Many Endpoints <ul><li>Clinical trials generally have one survival endpoint-time to recurrence, time to death, time to disease progression </li></ul><ul><li>Health outcomes can examine multiple endpoints simultaneously using survival data mining </li></ul>
- 5. Homogeneous Versus Heterogeneous Data <ul><li>Clinical trials generally use inclusion/exclusion criteria to define a homogeneous sample </li></ul><ul><li>Health outcomes have to rely upon heterogeneous data </li></ul><ul><ul><li>Populations are more gamma distributions than normal and this must be taken into consideration </li></ul></ul>
- 6. Large Versus Small Samples <ul><li>Clinical trials tend to use the smallest sample possible to achieve the desired power </li></ul><ul><ul><li>Database designed for analysis and data are very clean </li></ul></ul><ul><li>Health outcomes have an abundance of data and variables </li></ul><ul><ul><li>Power not an issue </li></ul></ul><ul><ul><li>Data are very messy and require considerable preprocessing </li></ul></ul>
- 7. Rare Occurrences <ul><li>Clinical trials not large enough to find all potential rare occurrences </li></ul><ul><li>Health outcomes have enough data to find rare occurrences and to predict the probability of occurrence </li></ul><ul><ul><li>Requires modifications to standard linear models </li></ul></ul><ul><ul><li>Predictive modeling much better at actual prediction </li></ul></ul>
- 8. Example 1 <ul><li>Ottenbacher, Kenneth J. Ottenbacher, Heather R. Tooth, Leigh. Ostir, Glenn V. </li></ul><ul><li>A review of two journals found that articles using multivariable logistic regression frequently did not report commonly recommended assumptions. Journal of Clinical Epidemiology. 57(11):1147-52, 2004 Nov. </li></ul>continued...
- 9. Example 1 <ul><li>Statistical significance testing or confidence intervals were reported in all articles. Methods for selecting independent variables were described in 82%, and specific procedures used to generate the models were discussed in 65%. </li></ul>continued...
- 10. Example 1 <ul><li>Fewer than 50% of the articles indicated if interactions were tested or met the recommended events per independent variable ratio of 10:1. </li></ul><ul><li>Fewer than 20% of the articles described conformity to a linear gradient, examined collinearity, reported information on validation procedures, goodness-of-fit, discrimination statistics, or provided complete information on variable coding. </li></ul>
- 11. Example 2 <ul><li>Brown, James M. O'Brien, Sean M. Wu, Changfu. Sikora, Jo Ann H. Griffith, Bartley P. Gammie, James S. Title: Isolated aortic valve replacement in North America comprising 108,687 patients in 10 years: changes in risks, valve types, and outcomes in the Society of Thoracic Surgeons National Database. Source: Journal of Thoracic & Cardiovascular Surgery. 137(1):82-90, 2009 Jan. </li></ul>continued...
- 12. Example 2 <ul><li>108,687 isolated aortic valve replacements were analyzed. Time-related trends were assessed by comparing distributions of risk factors, valve types, and outcomes in 1997 versus 2006. </li></ul><ul><li>Differences in case mix were summarized by comparing average predicted mortality risks with a logistic regression model. </li></ul><ul><li>Differences across subgroups and time were assessed. </li></ul>continued...
- 13. Example 2 <ul><li>RESULTS: There was a dramatic shift toward use of bioprosthetic valves. </li></ul><ul><li>Aortic valve replacement recipients in 2006 were older (mean age 65.9 vs 67.9 years, P < .001) with higher predicted operative mortality risk (2.75 vs 3.25, P < .001) </li></ul><ul><li>Observed mortality and permanent stroke rate fell (by 24% and 27%, respectively). </li></ul>continued...
- 14. Example 2 <ul><li>Female sex, age older than 70 years, and ejection fraction less than 30% were all related to higher mortality, higher stroke rate and longer postoperative stay. </li></ul><ul><li>There was a 39% reduction in mortality with preoperative renal failure. </li></ul>
- 15. Central Limit Theorem <ul><li>As the sample size increases to infinity, the distribution of the sample average approaches a normal distribution with mean μ and variance σ 2 /n. </li></ul><ul><li>As n approaches infinity, the variance approaches zero. </li></ul><ul><li>Therefore, the distribution of the sample average starts to look like a straight line at the point μ if n is too large. </li></ul>continued...
- 16. Central Limit Theorem <ul><li>In addition, the sample mean is very susceptible to the influence of outliers. </li></ul><ul><li>Moreover, the confidence limits are defined based upon the assumption of normality and symmetry. Therefore, the existence of many outliers will skew the confidence interval. </li></ul>
- 17. Nonparametric Statistics <ul><li>Nonparametric models still require symmetry. </li></ul><ul><li>Many populations are highly skewed so that these models also have problems </li></ul>
- 18. Dataset <ul><li>We use data from the National Inpatient Sample from 2005 </li></ul><ul><li>A stratified sample from 1000 hospitals from 37 states </li></ul><ul><li>Approximately 8 million inpatient stays </li></ul>
- 19. Distribution of Patient Stays
- 20. Normal Estimate
- 21. Kernel Density Estimation <ul><li>Instead of assuming that the population follows a known distribution, we can estimate it. </li></ul><ul><li>Kernel density estimation is an excellent method to use to do this </li></ul>continued...
- 22. Kernel Density Estimation
- 23. Proc KDE <ul><li>proc kde data=nis.diabetesless50los; </li></ul><ul><li>univar los/gridl= 0 gridu= 50 method=srot out=nis.kde50 bwm= 3 ; </li></ul><ul><li>run ; </li></ul>
- 24. Kernel Estimate of Length of Stay
- 25. Sampling from NIS <ul><li>Given that the National Inpatient Sample has 8 million records, we can consider it to be an infinite population. Therefore, we can sample from this population to see if it can be estimated by the Central Limit Theorem </li></ul><ul><li>We start with extracting 100 different samples of size N=5 </li></ul>
- 26. Examine Central Limit Theorem <ul><li>PROC SURVEYSELECT DATA=nis.nis_205 OUT=work.samples METHOD=SRS N=5 rep=100 noprint; </li></ul><ul><li>RUN; </li></ul><ul><li>proc means data=work.samples noprint; </li></ul><ul><li>by replicate; </li></ul><ul><li>var los; </li></ul><ul><li>output out=out mean=mean; </li></ul><ul><li>run; </li></ul>
- 27. Sample Size=5
- 28. Sample Size=30
- 29. Sample Size=100
- 30. Sample Size=1000
- 31. Confidence Limit The confidence limit excludes much of the actual population distribution
- 32. Confidence Limit With Larger n
- 33. Discussion <ul><li>An over-reliance on the Central Limit Theorem can give a very misleading picture of the population distribution. </li></ul><ul><li>Kernel density estimation (PROC KDE) allows an examination of the entire population distribution instead of just using the mean to represent the population. </li></ul><ul><li>Without the assumption of normality, we need to use predictive modeling. </li></ul>
- 34. Discussion <ul><li>This is true for both logistic and linear regression where the assumption of normality is required. </li></ul><ul><li>The two regression techniques do not work well with skewed populations. </li></ul><ul><li>We first look at logistic regression for rare occurrences </li></ul>
- 35. Problems With Regression <ul><li>Logistic regression is not designed to predict rare occurrences </li></ul><ul><li>With a rare occurrence, logistic regression will predict virtually all observations as non-occurrences </li></ul><ul><li>The accuracy will be high but the predictive ability of the model will be virtually nil. </li></ul>
- 36. Regression Equation
- 37. Threshold Value <ul><li>For Logistic regression, a threshold value is defined, and regression values above the threshold are predicted as 1 </li></ul><ul><li>Regression values below the threshold are predicted as 0 </li></ul><ul><li>Choice of threshold value optimizes error rate </li></ul>
- 38. Simple Regression
- 39. Classification Table
- 40. Classification With 3 Variables continued...
- 41. Classification With 3 Variables
- 42. Models <ul><li>Linear regression: </li></ul><ul><ul><li>Y = β 0 + β 1 X 1 + β 2 X 2 …….+ β k X k </li></ul></ul><ul><li>Logistic regression: </li></ul><ul><ul><li>log e (p/1− p) = β 0 + β 1 Χ 1 + β 2 Χ 2 …….β n Χ n </li></ul></ul><ul><li>Poisson regression </li></ul><ul><ul><li>log e (Y) = β 0 + β 1 Χ 1 + β 2 Χ 2 …….β n Χ n </li></ul></ul>
- 43. Poisson Distribution <ul><li>The parameter of the Poisson Distribution, λ , will represent the average mortality rate, say 2%. </li></ul><ul><li>Then the sample size times 2% will give the estimate for the number of deaths, say 1,000,000*0.02=20,000 </li></ul><ul><li>However, the problem still persists. </li></ul><ul><li>For example, septicemia has a 26% mortality rate, pneumonia has a 7.5% rate </li></ul>
- 44. Parameters <ul><li>The three conditions include approximately 25% of total hospitalizations, leaving 75% not accounted for. </li></ul><ul><li>The Poisson distribution can be accurate on those patients but cannot determine anything about the remaining 75% </li></ul><ul><li>If more patient conditions are added, the 25% will increase but not to the point that the model will have good predictability </li></ul>
- 45. Predictive Modeling <ul><li>Takes a different approach </li></ul><ul><li>Uses equal group sizes </li></ul><ul><ul><li>100% of the rarest level </li></ul></ul><ul><ul><li>Equal sample size of other level </li></ul></ul><ul><ul><li>Randomizes the selection of the sampling </li></ul></ul><ul><li>Uses prior probabilities to choose the optimal model </li></ul>
- 46. 50/50 Split in the Data Filter data to mortality outcome Filter data to non-mortality outcome Use PROC SURVEYSELECT to extract a subsample of non-mortality outcome Append the mortality outcome data to subsample
- 47. 75/25 Split in the Data
- 48. 90/10 Split in the Data
- 49. Validation <ul><li>The reduced sample is partitioned into training/validation/testing sets </li></ul><ul><li>Only need training/testing sets for regression models </li></ul><ul><li>Model is validated on the testing set </li></ul>
- 51. Sampling Node
- 52. Misclassification in Regression
- 53. ROC Curves
- 55. Rule Induction Results
- 56. Variable Selection
- 58. ROC Curves
- 59. Decile <ul><li>Data are sorted and divided into deciles </li></ul><ul><li>True positive patients with highest confidence come first </li></ul><ul><li>Next, positive patients with lower confidence. </li></ul><ul><li>True negative cases with lowest confidence come next </li></ul><ul><li>Next, negative cases with highest confidence. </li></ul>
- 60. Lift <ul><li>Target density =number of actually positive instances in that decile the total number of instances in the decile. </li></ul><ul><li>The lift =the ratio of the target density for the decile to the target density over all the test data. </li></ul><ul><li>Way to find patients most at risk for mortality (or infection) </li></ul>
- 61. Discussion <ul><li>Predictive modeling in Enterprise Miner has some capabilities that are possible, but extremely difficult in SAS/Stat </li></ul><ul><ul><li>Sampling a rare occurrence to a 50/50 split </li></ul></ul><ul><ul><li>Partitioning to validate the results </li></ul></ul><ul><ul><li>Comparing multiple models to find the one that is optimal </li></ul></ul><ul><ul><li>Variable selection </li></ul></ul>
- 62. Summary <ul><li>Clinical trials do differ from health outcomes research and the statistical techniques required must be adapted to outcomes research </li></ul><ul><li>Model assumptions are important, but too often ignored </li></ul><ul><li>We need to look at results in detail </li></ul><ul><li>Superficial consideration of results can lead to very erroneous conclusions </li></ul>

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment