Thesis.doc

718 views
666 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
718
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Thesis.doc

  1. 1. DATA MINING THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY By Johnathan P. Durbin B.S., University of Louisville, 1995 A Thesis Submitted to the Faculty of the Graduate School of the University of Louisville in Partial Fulfillment of the Requirements for the Degree of Master of Arts Department of Mathematics University of Louisville Louisville, Kentucky August 2001
  2. 2. A PRACTICE IN DATA MINING USING THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY By Johnathan P. Durbin B.S., University of Louisville, 1995 A thesis Approved on _________July 12, 2001________ by the following Reading Committee: __________________________________ Thesis Director __________________________________ __________________________________ ii
  3. 3. ABSTRACT Data mining is a technique with a number of methods used to explore large datasets from a variety of angles with a wide spectrum of analytical tools. There are techniques for finding data, cleaning data, and validating results. For years new data have been collected by educational, research, commercial, and governmental entities for future analysis. The 1997 National Ambulatory Medical Care Survey dataset (NAMCS) is such a dataset available in the public domain at the C.D.C. [Centers for Disease Control and Prevention] for public consumption. Once this dataset was found, imported, and cleaned, it was analyzed. Although statistical packages have become extremely sophisticated, commercial statistical packages do not do everything needed for data mining. For this reason, a program was written (DFEPP) to analyze the data to display the results in a different manner using visualization techniques to present the significant results in an easily digested yet informative manner. iii
  4. 4. TABLE OF CONTENTS Page ABSTRACT iii CHAPTER I. Introduction 1 II. Acquiring and Importing Data 3 2.1 Acquisition of Data 3 2.2 Importing Data 6 III. Data Visualization using the Difference From Expected Percentage Plot (DFEPP) Program Design and Use 10 3.1 The Graph Design 11 3.2 The Design of the DFEPP Program 14 3.3 The Use of the DFEPP Program 17 IV. Data Mining of the 1997 National Ambulatory Medical Care Survey (NAMCS) Dataset 21 4.1 Analysis of Payee Type by Practice Type 22 4.1.1 Workers Compensation 23 4.1.2 Medicare 32 4.1.3 Medicaid 36 4.1.4 Self-Pay 41 4.1.5 Privately Insured 45 4.1.6 All Other 46 4.2 HMOs 52 4.3 Modeling 58 4.3.1 Age Group Models 59 4.3.2 Modeling Classification of Pregnant 62 iv
  5. 5. V. Conclusions 67 REFERENCES 70 APPENDIX – A (Variable list) 72 VITA 98 v
  6. 6. LIST OF IMAGES Page IMAGE 1. A sample run of a web search tool (Copernic2000) 6 IMAGE 2. A working example of the output for data visualization: 14 IMAGE 3. An output example 17 IMAGE 4. A working example of the input for data visualization 19 IMAGE 5. A working example of the output for data visualization 20 IMAGE 6. Clementine Code 59 IMAGE 7. Neural Network for Age Group Model Output 60 IMAGE 8. Refined Neural Network for Age Group Model Output 61 IMAGE 9. Neural Network for Age Group Model 61 IMAGE 10. C5 Model for Pregnant Output 62 IMAGE 11. Refined C5 Model for Pregnant Output 64 IMAGE 12. Refined Rule Set for Pregnant Model Rule Set 64 IMAGE 13. Refined C5 Model (2) for Pregnant Output 65 IMAGE 14. Refined Rule Set (2) for Pregnant Model Rule Set 65 vi
  7. 7. LIST OF PLOTS PLOT 1. Workers compensation by Physician Specialty 23 PLOT 2. Adjusted plot after removal 28 PLOT 3. Medicare by Physician Specialty 32 PLOT 4. Medicaid by Physician Specialty 36 PLOT 5. Distribution of Medicaid population 37 PLOT 6. Age of Pediatric Patients 38 PLOT 7. Modified Medicaid by Physician Specialty 39 PLOT 8. Age of Dermatology Patients 40 PLOT 9. Self Pay by Physician Specialty 41 PLOT 10. Privately Insured by Physician Specialty 45 PLOT 11. All Other Payees by Physician Specialty 46 PLOT 12. HMO Membership Percent by Age 53 PLOT 13. HMO Membership by Age Group 53 PLOT 14. HMO Membership by Payee Type 54 PLOT 15. HMO Membership by Physician Specialty 55 PLOT 16. HMO Membership by Race 56 PLOT 17. Distribution of Asian/Pacific Islander Age 57 vii
  8. 8. LIST OF TABLES TABLE 1. Payee Types 22 TABLE 2. Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits 24 TABLE 3. Workers Compensation ICD-9 Codes for Orthopedic Visits 25 TABLE 4. Workers Compensation ICD-9 Grouped Codes for Neurology Visits 26 TABLE 5. Workers Compensation ICD-9 Codes for Neurology Visits 27 TABLE 6. New Proportions after WC Orthopedic Surgeon Visits are removed 28 TABLE 7. Workers’ Comp “Other” Physician Visits 30 TABLE 8. Age Statistics by Physician Type 33 TABLE 9. ICD-9 Codes tabled by Medicare Use 34 TABLE 10. Age Statistics by Payee Type 37 TABLE 11. HMO Membership by Physician Specialty 43 TABLE 12. Has Insurance by Physician Specialty 44 TABLE 13. Has Insurance by Physician Specialty 47 TABLE 14. Has Insurance by ICD-9 Codes/Pediatric 48 TABLE 15. All Pay Methods by Insurance/Pediatrics 49 TABLE 16. Has Insurance by ICD-9 Codes/Neurology 50 viii
  9. 9. TABLE 17. All Pay Methods by Insurance/Neurology 51 TABLE 18. HMO Membership 52 TABLE 19. Adjusted HMO Membership 52 TABLE 20. Age Statistics by Race 56 ix
  10. 10. CHAPTER I INTRODUCTION The purpose of this paper is to describe the process of data mining through an example. The primary purpose of data mining is to generate hypotheses to be examined for validity either with fresh data or by withholding a portion of the initial dataset for investigation. Data mining is a technique with a number of methods used to explore large datasets from a variety of angles with a wide spectrum of analytical tools. There are techniques for finding data, cleaning data, and validating results. For many years new data have been collected by educational, research, commercial, and governmental entities so that data mining can be used to find trends and patterns. Much of this available data have been stored in data warehouses (collections of datasets) or put away by an organization possibly to be examined in the future. The 1997 National Ambulatory Medical Care Survey dataset (NAMCS) analyzed in this paper is available in the public domain at the C.D.C. [Centers for Disease Control and Prevention] for public consumption along with several other medical datasets. Chapter II covers how to find medical datasets and import them into various statistical packages. Although statistical x
  11. 11. packages have become extremely sophisticated, commercial statistical packages do not do everything needed for data mining. For this reason, a program was written to analyze the data (Chapter III) to display the results in a different manner using visualization techniques to present the significant results in an easily digested yet informative manner. The NAMCS dataset is analyzed in Chapter IV using various statistical packages and the program developed in Chapter III. The NAMCS dataset consists of 24,615 patient visit records each containing 224 variables. The data were about personal physical attributes, physician’s practice and location, reasons and diagnoses for visits, medication given, insurance types, tests given, types of medical personnel seen, and other visit data (see Appendix – A for a full variable list). These data can be analyzed a variety of ways: differences between patient types in common practices or pay methods; examining whether certain practice types favor using staff over physicians; what practices or pay methods favor using screenings or tests; or simple analyses of various physical attributes of the different patient types. In this thesis, the ways in which different payee types visited the different practices are analyzed. Different payee types disproportionately visited certain practice types. Some of these disproportions are expected and others are less explainable. HMO membership and its distribution through age groups, practice types, pay methods, and races are also analyzed. Older patients and the practices that serve them had a lower rate of HMO membership but privately insured, “All Other” payees, and Asian/Pacific Islanders all had higher rates of membership. Another analysis was done using modeling techniques to determine patient AGE GROUP and another to determine if the patient was pregnant. A model was found with ~90% accuracy in determining whether someone was pregnant using age, reason for visit (Non-Illness care), xi
  12. 12. and sex of patient. A model to determine which age group the patient was in was much less accurate (~50%). In this thesis techniques to find, import, clean, and analyze data are discussed. Some of the techniques are used with the NAMCS dataset while other techniques are only discussed. A program is also written, by the author of this thesis, with visualization guidelines, discussed in chapter 3, to analyze the NAMCS dataset. xii
  13. 13. CHAPTER II ACQUIRING AND IMPORTING DATA The first step in a data mining process is to collect the data. A collection mechanism can be set up to obtain data or the data may already exist in a dataset from an outside source. After the necessary data have been acquired, they must be put into a format that can be imported into any statistical packages that will be used to analyze them. Once the data have been imported, they need to be cleaned for analysis. 2.1 Acquisition of Data The first step in the data mining process is to acquire the data. Depending on what is studied, a collection mechanism for data may have to be set up or the data can come from an outside source. Collecting data can be very expensive and time consuming, but necessary. When collecting the data during the study, the validity of the collection mechanism and the data are known. The necessary data may already exist. Studies on many topics have been done over time and the data for these studies may still be available. With the invention, and now wide spread use of the computer, much of the data xiii
  14. 14. for these studies are on magnetic media, easily copied, and transferable for fellow researchers to use. Governmental agencies, such as the Census Bureau (www.census.gov) and C.D.C. (www.cdc.gov), have collected data for years and have large datasets in the public domain online for downloading. The Freedom of Information Act (FOIA http://www.usdoj.gov/foia/) gives access to governmental data with some restrictions. These data may or may not be in an easily usable format and the restrictions may not allow all of the desired data to be made available due to privacy or security issues. Data from other countries are less restrictive and are available in a variety of formats. Data can be bought from outside sources. Some companies can be contracted to collect data or the data may have already been collected and are available for sale to researchers. When the data come from an outside source, the validity of the data should be considered. There are pros and cons to both ways of acquiring data but it is up to the researcher to find the data and to discuss its validity. For the purpose of this data mining project, a public domain database was used; one that was closely related to an aspect of health care. Much of the public domain data are already available on the Internet and the various search engines make it easy to find relevant datasets. Many of the search engines will point to Internet sites that give or sell data. The Lycos search engine (www.lycos.com) was developed by the Carnegie Mellon Institute and tends to point to more research oriented web sites than other search engines. Other search engines providing pointers to data include Excite (www.excite.com), Alta-Vista (www.altavista.com), and MSN (www.msn.com). xiv
  15. 15. A new generation of web tools have been developed to make searching easier and more thorough. Copernic2000 is one of these web tools; it searches many different search engine databases for whatever topic is being queried. These web search tools are highly configurable and can be modified to the individual preferences of users. Web users tend to prefer certain search engines and web search tools allow the user to focus on the search engines of their choice. The level of search in the databases can also be defined by choosing how many hits from each search engine database are allowed. These web search tools can also search other types of Internet sites such as news groups, email databases, online businesses, news, and many other focused sites. A sample run of a web search tool (Copernic2000) (Image 1) Whether a web search engine or web search tool is used, there are certain guidelines that should be followed. First, use a keyword such as “dataset” and avoid words such as “data” or “database”. Keywords “data” and “database” will point to xv
  16. 16. results, database programs, or databases of articles but the keyword “dataset” will focus on collections of data. Use the option of searching for all words in a query and if that does not work, use a search on any words in a query. When a URL is found, consider the source of the site and its possible biases. There is no optimal way to find data on the Internet but with the development and refinement of web search tools, locating data is becoming an easier task. 2.2 Importing Data. Once a dataset has been found, the dataset needs to be imported into statistical programs for analysis. The data mining process used to investigate the data relies on standard statistical packages such as SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS Clementine 5.2® (SPSS Inc.). In order to make the investigations, the statistical packages must be able to read the data. Data are not always in a format that the different statistical packages can automatically import. Many sites, such as the C.D.C., put their public data in an ASCII (text) format with rules of how to import the file correctly. Otherwise, the data are released in a database format or another standard type file. The dataset analyzed in this paper was in a self-extracting ZIP file that contained 12 ASCII files, one being a file that explained how the data file was arranged. There are many different file formats used to save data and to import data. There are pros and cons to each type of file format. ASCII files are generally either character delimited files or columnar fixed width files. Character delimited files use a special xvi
  17. 17. character such as a comma or tab to separate variable columns. When importing these type files, errors can occur when a special character is included in a text field, or the spacing may be shifted enough to confuse tab-delimited imports. Fixed width columnar ASCII files are not as easy to import, but the import allows the user to work with each variable and to define variable names, labels, and text related to each variable. The user can format and label the data to individual preference. The user should become very familiar with the data variables in the dataset. There are many standard file types that can be imported, including spreadsheet, database, and portable files. Spreadsheets are the easiest to import but they sometimes have record number limitations. The variable names can be included in the first row for ease of importation. In this study, the dataset used was imported into SPSS 10 from a columnar ASCII file. An attempt to write the 24,610 records to an Excel spreadsheet file failed and only wrote 16,383 of the records. This may be an issue with SPSS 10 and older restrictions on spreadsheet files. Database files are another type of file that can be imported. Flat file databases (all data contained in one table) and well designed relational databases (multiple tables related by keys) are not a problem to import but some relational databases are not always structured well and create importing problems. Different relational database tables within the same database may contain the identical table variable names that are not meant to be linked but the import features in some statistical programs try to link them anyway. Other table links may need to be defined in a certain way such as one to one, one to many, or many to many and these links do not always import the data correctly. Outside of having the data in the statistical packages’ xvii
  18. 18. file format, portable files are the best choice for importing data. The data with their variable names are stored in this portable file type for ease of import but the only failing of this portable file type is that it does not include variable labels or text related to nominal data. Usually researchers do not have much say in what format the data will be found, but if possible, they should request data in a portable format or the native format of their statistical package. Once the data are imported, they may need to be cleaned. Unless the data were formatted during the import, the variable labels and text related to nominal data have not been defined. It is not necessary to define them but the labels and nominal data text make the analysis easier to comprehend. Some data records may contain missing or invalid information and the records need to be either corrected or removed. Some variables may not be necessary and can also be removed. The dataset used in this paper initially contained 224 variables that were reduced to 33 variables as the analysis was refined. For a full list of variables, see Appendix A. Many variables contained information about the “marked” status of another variable and could be removed. Some removed variables were lengthy text entries that were rarely used. Other removed variables contained medicine codes. Many of the variables were removed after initial analyses showed little promise for them. Some categorical data can also be refined to be a more manageable size. One of the variables in the dataset contained more than 300 different categories that could have easily been refined to a more manageable 9 categories. Some data may also need some editing to fix errors such as missed decimal placements, text in numeric fields forcing numeric variables to import as text, and converting variable types to correct types of data. xviii
  19. 19. Data mining tools examine data from a variety of angles with a number of different statistical methods. Not all of these statistical tools or programs can read or write to common file types without loss of some formatting. Therefore trading data between programs can sometimes become a problem. SAS programs cannot read native SPSS 10 SAV files and SPSS programs can not read native SAS files. Both programs can read and write to common file types but the difficulties described previously can still occur. Saving data in an ASCII file from one program, then importing the data into another program can give delimiting problems, or if the columnar format is used, the variables have to be redefined. Transferring data from one statistical program to another using spreadsheet format will work better but the constraint on sheet size may limit the number of records transferred. Portable and database files are the best options currently available but these formats do not save the variable labels or the text related to nominal data. An ideal situation would be a format that all statistical packages could export to and import from without the loss of variable labels and text related to nominal data. Unfortunately, this ideal currently does not exist. xix
  20. 20. CHAPTER III DATA VISUALIZATION USING THE DIFFERENCE FROM EXPECTED PERCENTAGE PLOT (DFEPP): PROGRAM DESIGN AND USE One very important aspect of data mining is visualization, usually in graphical form. There are many different statistical programs that analyze data and have a number of graphical formats but these programs may not analyze the data in the desired way or present results in the best manner. Presenting information in a useful and digestible form is very important in the data mining process. Most papers are written for audiences with varying degrees of statistical knowledge and should be written to accommodate most, if not all, of the audience. Visual representation of information is the simplest way to digest results for the general population and technical detail can be added to validate information for those with greater statistical knowledge. The statistical packages used give effective analyses and reporting but they do not always present significant results in a manner desired by the investigator. For this reason, a program was written and designed, by the author of this thesis, in Visual Basic 6 using some guidelines in visualization. In this chapter the design and use of the Difference From Expected Percentage Plot (DFEPP) program will be covered. xx
  21. 21. 3.1 The Graph Design Presenting results from a data analysis in a format that is easily read is a necessity when analyzing and reporting on data. Analysis results should be presented in layers of detail from the most general to the most in-depth. Graphs and plots are easily understood and are used for a quick, less detailed, analysis of data. Tables and associated numeric information can also be used in the presentation of data for greater detail but are generally less easy to understand. A mix of the two types of presentations is an ideal way to present data analysis results to a general audience with varying degrees of statistical knowledge. There have been very few publications on data presentation and graphic design but the few publications written provide some basic guidelines. (Tufte, 1997 and White, 1984) xxi
  22. 22. According to Tufte’s “The Visual Display of Quantitative Information” (Tufte, 1997): Excellence in statistical graphics consist of complex ideas communicated with clarity, precision, and efficiency. Graphical displays should: • Show the data. • Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else. • Avoid distorting what the data have to say. • Present many numbers in a small space. • Make large data sets coherent. • Encourage the eye to compare different pieces of data. • Reveal the data at several levels of detail, from a broad overview to the fine structure. • Serve a reasonably clear purpose: description, exploration, tabulation, or decoration. • Be closely integrated with the statistical and verbal descriptions of a data set. xxii
  23. 23. Jan V. White’s “Using Charts and Graphs” (White, 1984) suggested some other concepts to include: • Sort from most to least significant. • Make sure plot segments are connected to associated text. • Make significances stand out. This graph (a plot from a program discussed later in this chapter) uses many of the concepts include in the books by White and Tufte. A working example of the output for data visualization: (Image 2) The above graph shows the data and is simple enough that the design of the graph is not a distraction from the data presentation. It reduces a large dataset to a simple plot, stratum information, count, chi square, and associated p-value to provide much information in a xxiii
  24. 24. small space and makes a large data set coherent. It encourages the eye to compare different pieces of data through the use of color and by listing the categories by significance. It serves a reasonably clear purpose: description, exploration, and tabulation and it is closely integrated with the statistical and verbal descriptions of a data set. For ease of readability, the text for each category (actual %, category name, count, chi square value, and associated p-value) are connected by a line to the associated bar plot. 3.2 The Design of the DFEPP Program There are a variety of factors to consider when writing any program: who will use the program, what operating systems will be used, what type of data will be used, the intended purpose, and the intended output. Some specialized programs can be written cryptically but they are usually for a very limited audience that is generally familiar with its use. Graphical User Interfaced (GUI) programs are much less cryptic and the easiest type of program to use for a novice. Older programs were developed where the user interacted via a command line interface that would intimidate some users, but most GUI based programs use standardized graphic and menu controls familiar to most computer users. GUI makes the programs extremely easy to use. Any program that might be used by the general public should be GUI based. Many different programming languages were considered in the development of this DFEPP program. ANSI (American National Standards Institute) C and C++ are very xxiv
  25. 25. powerful programming languages and can be compiled to run on many different operating systems, but they lack some of the features needed to copy a generated graph into a clipboard for pasting into other applications. The Visual C and C++ packages have a better user interface with the ability to copy graphs onto a clipboard but these languages are not ANSI compliant and will only run on a few types of operating systems. Java, developed by Sun Microsystems (www.sun.com), was another language considered for its portability but it is fairly limited with respect to pasting results into a windows clipboard. Microsoft Visual Basic 6® (VB6) was used to write the DFEPP program. Programs written in VB6 are extremely easy to program and use with Windows based controls and interface. Anyone who is somewhat familiar with Windows can use a VB6 coded program. In this program, the interface is familiar and the cutting and pasting of the generated graph into another program is a simple matter due to the tools included in VB6. The only downside of VB6 is that it only works on a limited number of operating systems (MS Windows based), but those few operating systems are on 90%+ of all PCs. Creating visualization programs require consideration of how the data are entered, processed, and used. A majority of programming languages can read and write to a variety of file types and structured files. Input from sources such as keyboards and scanners, and output to devices such as monitors and printers can be easily accomplished by most programming languages. The DFEPP program merely required some simple input into text boxes and a mouse click to plot the graph. VB6 gives an easy input method for the user as well as easy access to clipboard controls. The graphical output of the DFEPP program needed to be pasted into other Windows based program (such as Word xxv
  26. 26. and Excel) and the tools in VB6 programming environment allowed for easy copying and pasting of a graph. Other languages would also do all of the necessary processing but the input and output needed would not be as user-friendly. 3.3 The Use of the DFEPP Program The DFEPP program was written to show significant differences between expected and actual values of one stratum of a categorical variable across all strata of another categorical variable. The dataset to be analyzed in Chapter IV was reduced to 33 categorical variables containing data on patient demographics, types of physician practices, payment for services, and other information on ambulatory visits. An example of the use of this program is to look at how the different payee types disproportionately go to different practices. For instance, assuming that payee types visits practice types at the same rate as their overall percent of population, the privately insured should be 51% of each type of physician practice. An output example (Image 3) xxvi
  27. 27. The program sorts the categorical data (practice type (J)) from greatest percentage difference between actual (percent of actual privately insured in a practice type) and expected percentage (percent of privately insured 51% (H)) from greatest to least and generates a difference from expected percentage plot using the expected percent value (51%) as a baseline and the actual percent values to plot a bar graph. The user defines the major and minor percentage differences to their preference (L). The program highlights the major percentage differences in red and minor percentage differences in blue within the bar plot (I). Chi Square values are also derived using the number of elements in each stratum (practice types), and the actual percentages and expected percentages of the isolated strata (payee type (K)). For example, the privately insured were 51% of all patients but only 32% of 1418 cardiology patients, giving a Chi Square value of 204.84. ((51% − 32%)1418) 2 ((49% − 68%)1418) 2 + = 204.84 51% * 1418 49% * 1418 The Chi Square values are also highlighted by color for significance. In this paper an alpha of 0.01 is considered the cut-off point for major significance and the Chi Square values greater than or equal to 6.635 are highlighted red. Chi Square values between 3.841 and 6.635 are associated with an alpha of 0.05 and have lesser significance but are highlighted blue in case the user chooses to point out those significances with the lower alpha. The p-values that are associated with the Chi Square values with one degree of freedom are also given and highlighted to associated significance. If there is no significance then “No Sig” is displayed in the p-value column. xxvii
  28. 28. A working example of the input for data visualization: (Image 4) Box A is to input the title, B gives the baseline percentage, and E is to input the categories. The major and minor percentage differences are inputted to C. Column D contains the actual percentage of A in each of the associate categories in column E. Column F is the actual count of each of the associate categories in column E. Column G contains a series of check boxes that select the categories in the associated column E to be analyzed. Once the user has provided all of the necessary information, the plot option is chosen in the menu bar to give the following hanging plot: xxviii
  29. 29. A working example of the output for data visualization: (Image 5) H gives the strata analyzed with their expected percentage values. I is the difference from expected percentage plot using the expected percentage value as a baseline and the actual percentage values contained in J. J contains each category, the percentage of strata H in each category, and the number of total members per category. K contains the Chi Square values and associated p-values for the corresponding categories in J, highlighting the values with some significance by color. L is a legend for graph I explaining the major and minor significance lines. If the user is satisfied with the graph then the copy option may be chosen in the menu bar to copy the graph into the clipboard to paste in to another program. Otherwise the user closes the graph window to modify the initial data entry window, adjusts the graph options, and then plots the updated graph. xxix
  30. 30. CHAPTER IV DATA MINING OF THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY (NAMCS) DATASET The 1997 National Ambulatory Medical Care Survey (NAMCS) is a national probability sample survey conducted by the Division of Health Care Statistics, National Center for Health Statistics (NCHS), and Centers for Disease Control and Prevention (CDC). The survey consists of 24,715 patient records from visits to 1,247 physicians in the year 1997. Initially, each patient visit record consisted of 224 variables, including demographic information, diagnoses, drugs prescribed, types of visits, types of medical professionals seen, medical tests and screenings done, and location of physician office. During the data cleanup phase of the project, many of these variables were removed as the focus of the analysis narrowed, leaving 33 variables with information about pay method, race, age group, practice types, and other categorical information. Other variables were reduced from 500+ different categories, using the SPSS Transform/Compute feature, to far fewer categories. Once the data were cleaned, they were analyzed using a variety of statistics packages and methods. In section 4.1, the xxx
  31. 31. relationship of patient payee types to practice types was analyzed to look for disproportionate relationships. SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS Clementine 5.2® (SPSS Inc.) were all used to analyze the dataset but the DFEPP program was used to do much of the visualization. 4.1 Analysis of Payee Type by Practice Type In this section, the ways the different payee types visited the various practice types were analyzed. Initially there were fourteen practice types (Cardiologists, Dermatologists, General/Family Practice, General Surgery, Internal Medicine, Neurology, OB/GYN, Ophthalmology, Orthopedic Surgery, Otolaryngology, Pediatrics, Psychiatry, Urology, and Other) and nine types of payees (privately insured, Medicare, Medicaid, workers compensation, self-pay, no charge, other, unknown, and blank). Since payee types identified as no charge, other, unknown, and blank all had a limited number of records, they were collected into one “all other” payee type yielding the following distribution of payee types: Payee Types (Table 1) Count Col % Payee Private Insurance 12562 51.0% Type Medicare 5395 21.9% Medicaid 1945 7.9% Worker's Comp. 503 2.0% Self-Pay 2176 8.8% All Other 2029 8.2% Each payee type is given as a percentage of the overall study population and should be near the same percentage of each practice type’s patient load but this is not always the case. Many of the payee types correlated with different practice types but some of these xxxi
  32. 32. preferences are expected and some are not. Although statistical methods can be used to investigate specific hypotheses, the primary purpose of data mining is to generate hypotheses to be examined for validity either with fresh data or by withholding a portion of the initial dataset for investigation. The first investigation examined all of the cases and the relationships between payee type and the number of visits to a particular practice type. The DFEPP plot below gives an indication of this relationship: Workers Compensation by Physician Specialty (Plot 1) 4.1.1 Workers Compensation Workers compensation payees were 2% of all payee types and if there were relationships, would be expected to be near 2% of patient visits to each type of practice. The workers compensation payees go to orthopedic surgeons at a much higher rate than the expected 2% of visits to orthopedic surgeons. They were 18.1% of the 1616 orthopedic surgeons’ patients and the probability that the null hypothesis is true (actual xxxii
  33. 33. number of visits was the expected 2% of visits to orthopedic surgeons) is less than 0.0005 (χ2=1616.1, p<0.0005) showing that these types of payees go to orthopedic surgeons at a significantly higher rate. This is not completely unexpected since people go to orthopedic surgeons for breaks and bruises and these are the main types of injuries that occur at work. By using the filter and general tables/frequency features in SPSS 10, the reasons that workers compensation payees visited orthopedic surgeons can be determined. Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits (Table 2) ICD-9 Code Category Workers Comp. to Orthopedic Surgeons Count % 140-239 Neoplasms 1 .5% 240-279 Endocrine, nutritional and metabolic 1 .5% diseases, and immunity disorders 320-389 Diseases of the nervous system 18 8.1% and sense organs 680-709 Diseases of the skin and 3 1.4% subcutaneous tissue 710-739 Diseases of the musculoskeletal 70 31.7% system and connective tissue 780-799 Symptoms, signs, and ill-defined 1 .5% conditions 800-999 Injury and poisoning 109 49.3% V - Supplementary classification of factors influencing health status and contact with 18 8.1% health services The preceding table is somewhat vague and by using a less broad categorical variable for the ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) codes, a better understanding of these visits can be determined. The following table gives a better understanding of why the workers compensation payees went to orthopedic surgeons. xxxiii
  34. 34. Workers Compensation ICD-9 Codes for Orthopedic visits (Table 3) ICD-9-CM Codes for Workers Comp. to Orthopedic Surgeons Count % 00 intestinal infectious diseases 1 .5% 215.3 benign neoplasms Lower limb, including hip 1 .5% 278.0 Obesity 1 .5% 337.21 Reflex sympathetic dystrophy of the upper limb 1 .5% 35x.xx Carpal tunnel syndrome(13), Lesion of ulnar 17 7.7% nerve(2), Lesion of ulnar nerve(1), & Mononeuritis(1) 68x.xx Diseases of the skin and subcutaneous tissue 2 .9% 70x.xx 1 .5% 71x.xx Diseases of the musculoskeletal system and 26 11.8% connective tissue 72x.xx 43 19.5% 73x.xx 1 .5% 79x.xx ill-defined and unknown causes of morbidity and 1 .5% mortality 80x.xx fractures 1 .5% 81x.xx 13 5.9% 82x.xx 16 7.2% 83x.xx dislocations 12 5.4% 84x.xx sprains and strains of joints and adjacent muscles 50 22.6% 87x.xx open wound 1 .5% 88x.xx 5 2.3% 905.9 Late effect of traumatic amputation 1 .5% 92x.xx contusion with intact skin surface or crushing injury 5 2.3% 95x.xx injury to nerves and spinal cord 4 1.8% 996.6 Infection and inflammatory reaction due to internal 1 .5% prosthetic device, implant, and graft V1 persons with potential health hazards related to 5 2.3% personal and family history V4 persons with a condition influencing their health status 6 2.7% V5 persons encountering health services for specific 1 .5% procedures and aftercare V6 persons encountering health services in other 4 1.8% circumstances V9 missing 1 .5% xxxiv
  35. 35. Initially this table was created using the first 2 characters in the ICD-9-CM codes but categories that had just a few visits could be better described by extracting the full ICD-9-CM code from a complete non-abbreviated table of workers compensation visits to orthopedic surgeons. The table shows that 73.3% of the visits were for sprains, strains, breaks, and bruises while 9.5% of the visits were for nerve damage (7.7% carpal tunnel, 1.8% nerve/spinal cord damage), 4.1% were for cuts, and 13.1% for all other. The workers compensation payees also go to neurologists at a higher rate than the expected 2% of visits. They comprised 4.1% of the 703 neurology patients and the probability that the null hypothesis (actual % = 2% expected) is true is less than 0.0005 (χ2=15.82, p<0.0005). Therefore the alternative hypothesis is valid (workers compensation patients go at a significantly higher rate than expected to neurologists). By filtering the data and then tabling it, the reason the workers compensation payees visited this practice type can be determined. The following frequency table of workers compensation payees going to neurology visits shows why they went: Workers Compensation ICD-9 Grouped Codes for Neurology Visits (Table 4) ICD-9 Code Category Workers Comp. to Neurologists Count % 290-319 Mental disorders 2 6.9% 320-389 Diseases of the nervous system 3 10.3% and sense organs 710-739 Diseases of the musculoskeletal 9 31.0% system and connective tissue 780-799 Symptoms, signs, and ill-defined 5 17.2% conditions 800-999 Injury and poisoning 10 34.5% xxxv
  36. 36. Again, for such a small number of cases, a more in-depth analysis can be done by comparing the complete ICD-9-CM code of each patient’s visit. Workers Compensation ICD-9 Codes for Neurology Visits (Table 5) Physician's diagnoses for Workers' Comp. Neurology visits Count % 3102- Concussion 2 6.9% 3530-Nerve Dmg 1 3.4% 3540-Carpal Tunnel 2 6.9% 72210 Back Injuries 1 3.4% 72280 1 3.4% 7231- 1 3.4% 7242- 2 6.9% 7244- 1 3.4% 7245- 1 3.4% 7292-Soft Tissue Dmg 1 3.4% 7299- 1 3.4% 7803-Convulsions 1 3.4% 7820-Disturbance of skin sensation 4 13.8% 8471-SPRAINS AND STRAINS OF 4 13.8% JOINTS AND ADJACENT MUSCLES 8472- 3 10.3% 8479- 1 3.4% 8489- 1 3.4% 8840-Upper Limb Wound 1 3.4% The table shows that 55.1% of the visits were for sprains, strains, breaks, and bruises. 24.1% of the visits were for nerve damage (6.9% carpal tunnel, 17.2% nerve/spinal cord damage), 10.3% were for cuts, and 10.3% for all other. The workers compensation payees go to orthopedic surgeons for many of the same reasons but at somewhat different proportions. There were many other practice types that had significantly lower percentages of workers compensation visits but this is not unexpected. Children are not generally involved with work and therefore would not use workers compensation to pay for xxxvi
  37. 37. pediatric visits. OB/GYN and urology visits would also rarely be paid for by workers compensation. Excluding the visits to these practices and to the practices with a disproportionately higher percentage of visits will give a better representation of how the other types of practices are visited by workers compensation payees. This subset of workers compensation payees visits by the remaining practice types are distributed as follows: New Proportions after WC Orthopedic Surgeon Visits are removed (Table 6) Payee Type Count % Private Insurance 7962 47.0% Medicare 4483 26.5% Medicaid 1098 6.5% Worker's Comp. 251 1.5% Self-Pay 1753 10.3% All Other 1393 8.2% The workers compensation payees are reduced to 1.5% of the patient population. When the 1.5% value is used as the expected value to analyze the data with the DFEPP program, the following plot is generated: Adjusted plot after removal (Plot 2) xxxvii
  38. 38. The plot shows that there is not much deviation from the expected percentage but there are significant differences when the chi square values are considered. Visits to ‘Other’ physicians show the greatest deviation from the expected value with a significantly higher number than expected visits. ‘Other’ physicians treated workers compensation payees for a variety of reasons but mainly for the same reasons as the orthopedic surgeon visits: sprains, strains, breaks, cuts, and bruises (see table 7). xxxviii
  39. 39. Workers’ Comp “Other” Physician Visits (Table 7) ICD-9-CM Count Col % ICD-9-CM Count Col % Code Code 1119- 1 1.2% 81600 2 2.5% Related Disease Breaks, Bruises, Strains, Sprains, Cuts 25000 1 1.2% 8360- 1 1.2% 33720 1 1.2% 8404- 1 1.2% 33722 1 1.2% 8409- 2 2.5% 3540- 2 2.5% 8449- 1 1.2% 37205 1 1.2% 8460- 2 2.5% 49390 1 1.2% 8469- 1 1.2% 515-- 1 1.2% 8470- 3 3.7% e r B 55092 1 1.2% 8471- 2 2.5% 71885 1 1.2% 8472- 3 3.7% 71943 1 1.2% 8489- 3 3.7% 7210- 1 1.2% 8793- 1 1.2% 7217- 1 1.2% 8820- 1 1.2% 72210 4 4.9% 8830- 2 2.5% 72252 1 1.2% 8860- 1 1.2% 72280 1 1.2% 9064- 1 1.2% 7234- 1 1.2% 9069- 1 1.2% 72400 1 1.2% 9248- 3 3.7% 7242- 2 2.5% 9300- 1 1.2% 7244- 1 1.2% 9404- 1 1.2% 7245- 4 4.9% 94420 1 1.2% 7246- 1 1.2% 9556- 1 1.2% 7248- 1 1.2% 9594- 1 1.2% 72632 2 2.5% 9595- 1 1.2% 7294- 1 1.2% V135- 1 1.2% Personal 75612 1 1.2% V155- 1 1.2% History 7804- 1 1.2% V583- 1 1.2% 7809- 1 1.2% V6759 1 1.2% Follow-up 7820- 1 1.2% V703- 1 1.2% V990- 1 1.2% Blank Psychiatry and general surgery practices were also visited at a significantly higher rate than the expected 1.5% visit rate for the workers compensation payees. The main reason for psychiatric visits was depression (74%). General surgery visits tended to be for cuts, burns, and other wounds (33.3%) and 43% tended to be for breaks, bruises, strains and sprains. Notably there are significantly lower numbers of visits to ophthalmology (0.3% actual vs. 1.5% expected) and otolaryngology (0.1% actual vs. 1.5% expected) practices. This may show that the OSHA (Occupational Safety & Health Administration xxxix
  40. 40. http://www.osha.gov/) rules guarding vision and hearing loss work effectively to reduce such injuries. The dermatology visits were also significantly lower (0.1% actual vs. 1.5% expected) but many burns and other skin problems were treated by general surgery practices. This would explain, in part, the significantly higher number of general surgery visits and the correspondingly lower number of dermatology visits. With the workers compensation payees being 18.1% of the orthopedic surgery visits and only 2.0% of the total population, the other payee types visits to orthopedic surgeons will tend to show fewer visits. Therefore a lower number of visits will be correspondingly less significant than shown in the DFEPP plots. Although there were other significant disproportions, the workers compensation payee visits were too few to create any major disproportion in other payee types’ visits to the various practices. xl
  41. 41. 4.1.2 Medicare Medicare payees were 21.9% of all payee types and if they showed no preference, would be expected to be near 21.9% of patient visits to each type of practice. This is not the case. Medicare patients went to cardiologists, urologists, ophthalmologists, and to Medicare by Physician Specialty (Plot 3) internal medicine visits at significantly higher rates. The probability that they were the expected 21.9% of each of the visit loads for each practice is less than 0.0005. They were 53.4% of 1418 visits to cardiologists (χ2=822.63, p<0.0005), 38.9% of 1072 to urologists (χ2=181.13, p<0.0005), 38.6% of 1437 to ophthalmologists (χ2=234.31, p<0.0005), and 33.1% of the 2358 visits for internal medicine (χ2=172.94, p<0.0005). All but the internal medicine visits are expected. The Medicare population consists of retired or disabled individuals. The average age of the Medicare population is 71.6 years with a standard xli
  42. 42. deviation of 13.01 years. These types of practices treat heart problems, eyesight, and urinary problems and these are the problems occurring in an older population. Age statistics by Physician Type (Table 8) AGE Physician Specality Mean N Std. Deviation General and family 42.73 3834 23.33 practice Internal medicine 55.09 2358 20.06 Pediatrics 5.34 2651 7.33 General surgery 49.67 1270 20.47 Obstetrics and 35.82 2022 13.91 gynecology Orthopedic surgery 45.55 1222 21.91 Cardiovascular disease 65.41 1418 15.26 Dermatology 46.38 1409 22.49 Urology 57.25 1072 20.16 Psychiatry 43.29 1461 16.96 Neurology 46.23 703 21.83 Ophthalmology 58.65 1437 22.51 Otolaryngology 39.84 1175 24.81 All other 52.26 2578 19.62 Total 43.89 24610 24.81 The average age of the entire population is 43.89 years with a standard deviation of 24.81 years. The Medicare payees are significantly older; therefore they will disproportionately visit those practices. The significantly higher number of internal medicine visits by this population is harder to explain. Table 9 shows why Medicare and non-Medicare payees visited internal medicine practices: xlii
  43. 43. ICD-9 Codes tabled by Medicare Use (Table 9) Uses Medicare False True ICD-9 Code Category ICD-9 Code Category Count % Count % Infectious and parasitic 47 3.0% 8 1.0% diseases Neoplasms 17 1.1% 10 1.3% Endocrine, nutritional and metabolic diseases, and 159 10.1% 86 11.0% immunity Diseases of the blood and blood-forming organs 12 .8% 5 .6% Mental disorders 47 3.0% 13 1.7% Diseases of the nervous system and sense 68 4.3% 26 3.3% organs Diseases of the 208 13.2% 247 31.7% circulatory system Diseases of the 246 15.6% 76 9.7% respiratory system Diseases of the digestive 55 3.5% 22 2.8% system Diseases of the 62 3.9% 27 3.5% genitourinary system Complications of pregnancy, childbirth, and 1 .1% 1 .1% the puerperium Diseases of the skin and 54 3.4% 12 1.5% subcutaneous tissue Diseases of the musculoskeletal system 142 9.0% 63 8.1% and connective tissue Congenital anomalies 1 .1% Symptoms, signs, and 158 10.0% 85 10.9% ill-defined conditions Injury and poisoning 85 5.4% 21 2.7% Supplementary classification of factors 216 13.7% 78 10.0% influencing health s Medicare payees went to internal medicine practices for diseases of the circulatory system at a very disproportionate rate. A total of 13.2% of the population of non- Medicare payees visited this practice type for diseases of the circulatory system but 31.7% of the population of Medicare payees visited this practice type for the same xliii
  44. 44. diseases. The other categorical reasons for the visits to this practice by Medicare and non- Medicare payees were not that different. By reducing the number of visits for diseases of the circulatory system of the Medicare population to the non-Medicare percentage rate, the rate of Medicare payees going to internal medicine visits becomes less significant at 28.7%. By reducing the visits, a new χ2 value of 59.8 (p<0.0005) was computed showing that there was still a significantly higher number of visits to this practice type by Medicare payees. Other practices were visited at significantly lower rates. 0.8% of 2651 to pediatricians (χ2=690.05, p<0.0005) and 4.7% of 2022 OB/GYN (χ2=349.74, p<0.0005). The Medicare visits to pediatricians are probably due to recording errors. The low rate of visits to OB/GYNs for Medicare payees is not an unexpected result. The average age of OB/GYN patients is 35.82 years with a standard deviation of 13.91 years. The average age for Medicare patients is 71.9 years and this is over two standard deviations from the average OB/GYN patients’ age. With the Medicare payees responsible for 53.4% of the visits to cardiologists and only 21.9% of the total population, the other payee types’ visits to cardiologists will tend to show fewer visits and a correspondingly lower number of visits will be less significant than shown in the DFEPP plots. Urology and Ophthalmology visits were also at significantly higher rates, but lesser, and will also skew downward the rates of other payee types visits to these practices. xliv
  45. 45. 4.1.3 Medicaid Medicaid payees accounted for 7.9% of all payee types and if they showed no preference, would be expected to be near 7.9% of patient visits to each type of practice. The Medicaid payees go to pediatricians at a much higher rate than the expected 7.9% of visits to pediatricians. They were responsible for 20.0% of the 2651 pediatric patients and the probability that the null hypothesis is true (actual number of visits was the expected 7.9% of visits to pediatricians) is less than 0.0005 (χ2=533.45, p<0.0005) showing that these types of payees go to pediatricians at a significantly higher rate. Medicaid by Physician Specialty (Plot 4) The higher rate of Medicaid payees to pediatricians is not unexpected. The average age for Medicaid payees is 27.47 years with a standard deviation of 24.53 years. The distribution for this population is not normal and plot 5 shows this. xlv
  46. 46. Distribution of Medicaid population (Plot 5) 400 300 200 100 Std. Dev = 24.43 Mean = 27.5 0 N = 1945.00 0. 10 20 30 40 50 60 70 80 90 10 0 .0 .0 .0 .0 .0 .0 .0 .0 .0 0. 0 AGE The distribution of Medicaid payees is skewed towards the younger ages and it is the youngest of all payee types. Age Statistics by Payee Type (Table 10) AGE Payee Type Mean N Std. Deviation Private Insurance 36.22 12562 21.24 Medicare 71.61 5395 13.01 Medicaid 27.47 1945 24.43 Worker's Comp. 41.23 503 12.91 Self-Pay 36.88 2176 19.34 All Other 41.59 2029 22.02 Total 43.89 24610 24.81 95% of the pediatric visits were by patients 20 years or younger (plot 6) and since the Medicaid population is the youngest, it would carry a disproportionately higher rate of visits. xlvi
  47. 47. Age of Pediatric Patients (Plot 6) 1400 1200 1000 800 600 400 200 Std. Dev = 7.33 Mean = 5.3 0 N = 2651.00 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 AGE of Pediatric patients There were other practices that the Medicaid payees visited at lower than expected rates. By removing the pediatric visits, it can be determined how the Medicaid population visited the other practices. A new expected percentage of 6.4% of Medicaid payees is used to re-evaluate the data with the DFEPP plot. xlvii
  48. 48. Modified Medicaid by Physician Specialty (Plot 7) Urology and orthopedic surgery practices were visited at lower than expected rates but this is merely a reflection of the disproportionately higher visits by the Medicare and workers compensation payees to these practice types respectively. The OB/GYN visits are significantly higher than expected but this population contains a greater percentage of women in child bearing age and with the significantly lower number of Medicare patients attending this practice, a higher than expected result should visit OB/GYNs. Surprisingly the visits to dermatologists by the Medicaid payees are significantly lower than expected. Many people believe that dermatology patients are mainly children with acne problems. Plot 8 shows how the dermatology visits are distributed by age: xlviii
  49. 49. Age of Dermatology Patients (Plot 8) 140 120 100 80 60 40 20 Std. Dev = 22.49 Mean = 46.4 0 N = 1409.00 0. 10 20 30 40 50 60 70 80 90 10 0 .0 .0 .0 .0 .0 .0 .0 .0 .0 0. 0 AGE of all Dermatologists' Patients The average age of dermatology patients is 46.4 years with a standard deviation of 22.49 years. The population of dermatology patients is far older than Medicaid payees and would therefore have fewer Medicaid payees. Although there was a disproportionately higher number of pediatric visits in the Medicaid population, the lack of visits in the Medicare population will offset the higher rate in this population giving the remaining payee types the potential to have near their expected distribution for pediatric visits. The other practices of the Medicaid population showed preferences that are merely a reflection of other payee types disproportionately visiting those practices. xlix
  50. 50. 4.1.4 Self-Pay Self-Pay payees were 8.8% of all payee types and if they showed no preference, would be expected to be near 8.8% of patient visits to each type of practice. The Self-Pay payees go to psychiatrists at a much higher rate than expected. Self Pay by Physician Specialty (Plot 9) They represented 26.0% of the 1461 psychiatric patients and the probability that the null hypothesis is true (actual number of visits was the expected 8.8% of visits to psychiatrists) is less than 0.0005 (χ2=538.55, p<0.0005) showing that these types of payees go to psychiatrists at a significantly higher rate. This presents some possibilities: that the uninsured have more problems that require psychiatric visits or that insurance will not pay for psychiatric visits. The first possibility is hard to explore but the second can be explored indirectly. There was no variable to determine if someone was insured l
  51. 51. but there was a variable to determine if the patient was a member of an HMO. Overall 25.1% of patients were HMO members but only 4.4% of Self-Pay payees were members of an HMO. When isolating the self-pay psychiatric visits, 8.4% of the patients were members of an HMO. This shows that the self-pay patients going to psychiatrists were more apt to be HMO members when compared to the entire self-pay population and therefore were more apt to have insurance. Another consideration is that people did not pay for these visits with insurance and therefore would not mark down whether or not they belonged to an HMO. Regardless, self-pay payees do visit psychiatrists at a significantly higher rate and at least 8.4% of the visits were by insured patients. Self-pay payees also visited dermatologists at a significantly higher rate. They represented 16.9% of the 1408 dermatology patients and the probability that the null hypothesis is true (actual number of visits was the expected 8.8% of visits to dermatologists) is less than 0.0005 (χ2=115.19, p<0.0005) showing that self-pay payees go to dermatologists at a significantly higher rate. Only 5.0% of the dermatology self-pay payees were members of an HMO and this is not significantly higher than the 4.4% however there is another way to evaluate the HMO data. There were four different responses to HMO insured: Yes, No, Unknown, and Blank. li
  52. 52. HMO Membership by Physician Specialty (Table 11) Does the patient belong to an HMO? yes no unknown blank Row % Row % Row % Row % Physician General and family 2.2% 77.1% 19.6% 1.1% Specality practice Internal medicine 7.0% 84.2% 8.8% Pediatrics 3.0% 89.1% 6.7% 1.2% General surgery 1.6% 90.5% 7.9% Obstetrics and 1.5% 92.6% 5.9% gynecology Orthopedic surgery 5.0% 82.5% 2.5% 10.0% Cardiovascular disease 95.7% 4.3% Dermatology 5.0% 72.3% 22.3% .4% Urology 96.4% 3.6% Psychiatry 8.4% 54.7% 35.5% 1.3% Neurology 89.1% 10.9% Ophthalmology 3.9% 76.6% 18.0% 1.6% Otolaryngology 6.1% 89.0% 3.7% 1.2% All other 5.4% 59.5% 35.1% The “Unknown” responses are more than likely insured patients that do not know if they have an HMO plan, not uninsured that are unsure if they have an HMO (insurance) plan. By collecting the “Yes” and “Unknown” responses into an “Insured” response and the “No” responses into a “Possible but No HMO” response, who are likely and possibly insured may be determined yielding the following distribution (Table 12): lii
  53. 53. Has Insurance by Physician Specialty (Table 12) Has Insurance Possible but No Insured HMO Blank Row % Row % Row % Physician General and family 21.8% 77.1% 1.1% Specality practice Internal medicine 15.8% 84.2% Pediatrics 9.7% 89.1% 1.2% General surgery 9.5% 90.5% Obstetrics and 7.4% 92.6% gynecology Orthopedic surgery 7.5% 82.5% 10.0% Cardiovascular disease 4.3% 95.7% Dermatology 27.3% 72.3% .4% Urology 3.6% 96.4% Psychiatry 43.9% 54.7% 1.3% Neurology 10.9% 89.1% Ophthalmology 21.9% 76.6% 1.6% Otolaryngology 9.8% 89.0% 1.2% All other 40.5% 59.5% Total 24.3% 74.8% .9% The modified distribution gives a better idea of which patients went to the various practices with insurance. On average, at least 24.3% of self-pay payees had insurance but visits to psychiatrists had a much higher rate of insured self-pay patients (43.9%) showing possibly that insurance companies tend not to cover psychiatric services and the patients have to pick up the cost. The dermatology patients also show a disproportionately higher number of self-pay payees but their insured percentage is not that different from the rest of the self-payees. liii
  54. 54. 4.1.5 Privately Insured The privately insured went to many practice types at highly disproportionate rates. They were the second youngest population within this study, and would be expected to Privately Insured by Physician Specialty (Plot 10) favor certain practices. They were not older and would not tend to see cardiologists for heart disease or ophthalmologists for failing eyesight but they were young enough to be of family bearing age and would tend to see OB/GYN and pediatricians. The disproportionate visits to cardiologists, OB/GYN, ophthalmologists, and pediatricians within the privately insured population are nearly in inverse proportion to the Medicare population’s visits for these four practices. The otolaryngology visits are the only unexplainable disproportionately higher visited practice in the privately insured population. The privately insured go to otolaryngologists at a much higher rate than the expected 51.0% of otolaryngology visits. They were responsible for 63.3% of the 1175 otolaryngology patients and the probability that the null hypothesis is true (actual number liv
  55. 55. of visits was the expected 51.0% of visits to otolaryngologists) is less than 0.0005 (χ2=71.13, p<0.0005) showing that these types of payees go to otolaryngologists at a significantly higher rate. There were other significances outside of the five mentioned practices but none of the other practice types showed a significant deviation from the expected percentage (greater than 10%) and are not analyzed. 4.1.6 All Other The last payee type analyzed is the “All Other” payee. The all other payee type consists of the no charge, other, unknown, and blank payee types. There were too few visits in each of the subcategories (no charge, other, unknown, and blank payee types) to effectively analyze but the combined “All Other” payee type had a sufficient number of visits to analyze. The all other payees were 8.2% of the population and would be expected to be near 8.2% of patient visits to each practice type. All Other Payees by Physician Specialty (Plot 11) lv
  56. 56. The visits by the all other payee types to each of the practice types were all within 10% of their expected percentage of 8.2%. Only the neurology visits approached the 10% difference threshold used as a cutoff point. By using the HMO variable, the patients with insurance can be extracted. Has Insurance by Physician Specialty (Table 13) Has Insurance Possible but No Insured HMO Blank Row % Row % Row % Physician General and family 59.2% 35.0% 5.8% Specality practice Internal medicine 61.9% 31.9% 6.2% Pediatrics 78.0% 18.1% 4.0% General surgery 25.3% 50.0% 24.7% Obstetrics and 60.4% 37.9% 1.6% gynecology Orthopedic surgery 52.0% 40.0% 8.0% Cardiovascular disease 73.3% 19.8% 7.0% Dermatology 44.7% 48.5% 6.8% Urology 51.2% 41.5% 7.3% Psychiatry 49.1% 47.3% 3.6% Neurology 73.3% 25.0% 1.7% Ophthalmology 63.8% 28.9% 7.2% Otolaryngology 28.6% 63.3% 8.2% All other 58.0% 40.6% 1.3% Total 57.8% 35.8% 6.4% Note that 57.8% of the all other payee type may have had some insurance. The variable “Has Insurance” was previously defined in section 4.1.4. Similarly, 73.3% of the neurology patients may have had insurance. This is significantly higher than the overall 57.8% average for this payee type showing that insurance tends to not cover neurology visits as well as the other practice types. Pediatric visits by this payee type also had a lvi
  57. 57. significantly higher number of insured visitors. By reviewing the ICD-9 codes, the reasons patients went to the different practices can be determined. lvii
  58. 58. Has Insurance by ICD-9 Codes/Pediatric (Table 14) Physician Specality Pediatrics Has Insurance Possible but No Insured HMO Blank Layer % Layer % Layer % ICD-9 Infectious and parasitic 4.5% 1.7% Code diseases Category Neoplasms Endocrine, nutritional and metabolic diseases, and .6% .6% immunity Diseases of the blood and blood-forming organs .6% Mental disorders .6% .6% Diseases of the nervous system and sense 10.7% 2.8% organs Diseases of the circulatory system Diseases of the 16.9% 5.1% .6% respiratory system Diseases of the digestive 2.8% .6% system Diseases of the .6% genitourinary system Complications of pregnancy, childbirth, and the puerperium Diseases of the skin and 2.8% .6% .6% subcutaneous tissue Diseases of the musculoskeletal system .6% and connective tissue Congenital anomalies .6% Symptoms, signs, and 2.8% ill-defined conditions Injury and poisoning 2.8% .6% Supplementary classification of factors 32.2% 5.1% 2.3% influencing health s lviii
  59. 59. As shown, 26.6% of the pediatric visits in the insured all other payee type went for diagnoses V20.2 (Routine infant or child health check (a subset of “Supplementary classification of factors influencing health” 32.2%)). This group also went for diseases of the nervous system/sense organs (10.7%) (hearing loss/ear infections) and of the respiratory system (16.9%) (soar throats/ tonsillitis/ colds). All Pay Methods by Insurance/Pediatrics (Table 15) Physician Specality Pediatrics Has Insurance Insured Possible but No HMO Blank Count Row % Count Row % Count Row % Primary Private Insurance 924 52.6% 822 46.8% 11 .6% expected Medicare 6 28.6% 15 71.4% source of Medicaid 137 25.8% 393 74.0% 1 .2% payment for Worker's Compensation the visit Self-pay 16 9.7% 147 89.1% 2 1.2% No charge 1 25.0% 3 75.0% Other 122 81.9% 25 16.8% 2 1.3% Unknown 8 72.7% 3 27.3% Blank 7 53.8% 1 7.7% 5 38.5% When looking at the expanded list of pay methods, pediatric visits were paid for by other means 81.9% of the time. This could merely be families using local government funded health clinics for pediatric visits. lix
  60. 60. Has Insurance by ICD-9 Codes/Neurology (Table 16) Physician Specality Neurology Has Insurance Possible but No Insured HMO Blank Layer % Layer % Layer % ICD-9 Infectious and parasitic .8% .8% Code diseases Category Neoplasms 1.7% Endocrine, nutritional and metabolic diseases, and immunity Diseases of the blood and blood-forming organs Mental disorders 2.5% .8% Diseases of the nervous system and sense 30.8% 5.0% 1.7% organs Diseases of the .8% 1.7% circulatory system Diseases of the respiratory system Diseases of the digestive system Diseases of the genitourinary system Complications of pregnancy, childbirth, and the puerperium Diseases of the skin and subcutaneous tissue Diseases of the musculoskeletal system 5.8% 5.0% and connective tissue Congenital anomalies .8% .8% Symptoms, signs, and 17.5% 3.3% ill-defined conditions Injury and poisoning .8% 3.3% Supplementary classification of factors 11.7% 4.2% influencing health s The 30.8% of the all other visits to neurologists for diseases of the nervous system and sense organs were not covered by insurance even though the patient probably had insurance; 17.5% went for Symptoms, signs, and ill-defined conditions lx
  61. 61. (apnea/convulsions/nervous system injury) and 11.7% of the visits were for follow ups and paper work. All Pay Methods by Insurance/Neurology (Table 17) Physician Specality Neurology Has Insurance Insured Possible but No HMO Blank Count Row % Count Row % Count Row % Primary Private Insurance 134 42.7% 180 57.3% expected Medicare 16 12.7% 109 86.5% 1 .8% source of Medicaid 3 5.1% 56 94.9% payment for Worker's Compensation 11 37.9% 18 62.1% the visit Self-pay 6 10.9% 49 89.1% No charge 2 100.0% Other 16 39.0% 25 61.0% Unknown 71 97.3% 2 2.7% Blank 1 25.0% 1 25.0% 2 50.0% A majority of the visits for this payee type were by unknown ways of pay for neurology visits. This may show a tendency for insurance not to cover neurological disorders, leaving patients to pay for these problems themselves. Although visits to neurologists by the all other payee type do show significance, this payee type has the least significant difference of all types. Each different payee type had significant disproportions in the way patients visit the different practices. Many were expected but a few were not easily explained. The workers compensation payees went for bumps and bruises; the Medicare population went to practices that serve ailments in older patients. The Medicaid population is very young and sees practices that serve children and adults of child bearing age. The privately insured visited many practices disproportionately but most of the differences could be lxi
  62. 62. attributed to the other payee types’ disproportions. Self-Pay tended to pay for psychiatry and dermatology visits at a significantly higher rate showing that these practice visits are not covered as well as the other practices by insurance. The “All Other” payees went to neurologists at a significantly higher rate leaving a majority of them with an unknown way of paying for these services. 4.2 HMOs What do HMOs pay for? Who are members of HMOs? Are the practices visited significantly different when compared to the non-HMO population? These are all questions that can be answered by an analysis of this dataset. Initially there were four different types of responses to the question of whether or not the patient was a member of an HMO (yes, no, unknown, and left blank). HMO Membership (Table 18) Count Col % Does the yes 6187 25.1% patient belong no 15853 64.4% to an HMO? unknown 2242 9.1% blank 328 1.3% By making an assumption that the unknown and blank responses are proportionately distributed through the yes and no responses and removing them, the real proportion of HMO membership may be determined. Adjusted HMO Membership (Table 19) lxii
  63. 63. Count Col % Does the patient belong yes 6187 28.1% to an HMO? no 15853 71.9% By using the adjusted figures, 28.1% of this population is aware that they are members and 71.9% is aware that they are not. Who are members of HMOs? The younger patients (under 65 years) were more apt to be members of HMOs than patients in the oldest two age groups (65-74 years and 75+) HMO Membership Percent by Age (Plot 12) % of HMO Membership 45.00 40.00 35.00 30.00 25.00 20.00 % of HMO Membership 15.00 10.00 5.00 0.00 Under 15-24 25-44 45-64 65-74 75 15 years years years years years years and over In all, 28.1% of patients were members of an HMO but three of the age groups significantly deviated from the expected proportion. HMO Membership by Age Group (Plot 13) lxiii
  64. 64. The membership in the two older age groups is significantly lower than for other groups (patients 75 years and over: 13.7% actual vs. 28.1% expected, χ2=281.11, p<0.0005) (patients with 65-74 years: 17.5% actual vs. 28.1% expected, χ2=167.34, p<0.0005). The youngest age group had a significantly higher rate of membership in HMOs (38.9% actual vs. 28.1% expected, χ2=214.07, p<0.0005). A majority of the older two age groups are eligible for Medicare. HMO Membership by Payee Type (Plot 14) Surprisingly, the Medicare population does not have the lowest rate of HMO membership (6.7% actual vs. 28.1% expected, χ2=1133.35, p<0.0005). People who had to self pay had the lowest membership rate (5.5% actual vs. 28.1% expected, χ2=435.58, p<0.0005). This could be for a variety of reasons: uninsured self pay patients do not have insurance and would not have an HMO membership, or if the patient was a member and the visit was not covered, they may not have marked being an HMO member. The significantly lower rate of workers compensation (11.7% actual vs. 28.1% expected, χ2=44.46, p<0.0005) visits may be due to workers compensation paying for the visit and not the patients’ private insurance. Therefore the patients may have not marked HMO coverage even if they were members. The Medicaid population is the youngest population and should lxiv
  65. 65. follow the younger age groups higher level of HMO membership but their membership rate is significantly lower than expected (14.6% actual vs. 28.1% expected, χ2=164.62, p<0.0005). A reason for the low rate could be that Medicaid programs are state run and only some of the states have HMO options. Also, these data were collected in 1997 when the concept of Medicaid HMOs was not widely implemented. The significantly0 higher rate of privately insured (39.9% actual vs. 28.1% expected, χ2=7692.22, p<0.0005) is partially explained by the lack of HMO coverage in the state run programs, reducing expected average. The higher rate does show that the privately insured are much more likely to have been members of an HMO than any other insured type. The all other payees show the greatest deviation from the expected member rate. They have a significantly higher rate of HMO membership (52.9% actual vs. 28.1% expected, χ2=469.71, p<0.0005). Many of the reasons why people were in this group were explored in the previous section (public funded family clinics, neurology visits uncovered). What do HMOs pay for? HMO Membership by Physician Specialty (Plot 15) lxv
  66. 66. Plot 15 shows the rates of HMO membership for each physician type. 43.6% of visits to pediatricians are by HMO members. This is significantly higher than the expected rate for pediatric visits (43.6% actual vs. 28.1% expected, χ2=297.16, p<0.0005). The pediatric patients are young and would be expected to follow the higher rate of membership of the younger age groups but the higher rate cannot be completely explained by this. Many of the lower than expected rates can be attributed to age group preferences such as the older age groups’ preferred practice types (urology, cardiologists, and ophthalmologists) with their lower rate of membership. OB/GYN visits are mainly for a younger population and that rate would be expected to be higher. There are other differences but most of them can be correlated to age group preferences. Does any race favor HMOs? When each race is compared to the 28.1% baseline, the Asian/Pacific Islander population has a significantly higher rate of HMO membership. HMO Membership by Race (Plot 16) The Asian/Pacific Islander age is not significantly different from the other races so age cannot explain the higher HMO rate. Another factor such as location or culture may play a role. Age Statistics by Race (Table 20) lxvi
  67. 67. AGE RACE Mean N Std. Deviation White 44.42 19186 25.08 Black 39.58 2154 24.42 Asian/Pacific Islander 41.62 700 23.91 Total 43.86 22040 25.03 lxvii
  68. 68. Distribution of Asian/Pacific Islander Age (Plot 17) 60 50 40 30 20 10 30.0 80.0 10.0 15.0 20.0 35.0 40.0 55.0 60.0 65.0 70.0 75.0 85.0 90.0 25.0 45.0 50.0 0.0 5.0 0 AGE The Asian/Pacific Islander population is not distributed skewed to the younger ages. Membership in an HMO differed greatly when looking at the six age groups and three races. The older the patient, the less likely they were to be a member of an HMO. Most of this is due to the Medicare population’s lack of HMO membership. The youngest age group was most likely to be a member of an HMO. If the patient is an Asian/Pacific Islander, they are more likely to be an HMO member than a member of another race. Privately Insured and “All Other” payees had the greatest membership and Medicare, Medicaid, self-pay, and workers compensation had significantly lower than average membership rates. Different practices were disproportionately visited by HMO members at significant rates. Much of this is due to the type of practice and patients ages. Practices that see predominately older patients will have a lower rate of HMO members. Conversely, practices that see predominately younger patients will have a higher rate of HMO members. lxviii

×