Data collection & managementPresentation Transcript
Medicine & Society IICollecting & Managing Data Dr Azmi Mohd Tamil Dept. of Community Health, Faculty of Medicine, UKM notes partially based on a lecture by Assc. Prof. Dr. Roslina Abd. Manap
Sampling Choosing a relatively small subset such that it can adequately represent the entire spectrum of population subjects Aim to extrapolate results back to a substantially larger population to save time, money, efficiency and safety.
SAMPLINGPROBABILITY NON- SAMPLING equal chance of being PROBABILITY selected SAMPLING • simple random, • convenience, • systematic, • • quota, stratified, • multistage, • purposive. • cluster
SAMPLING & TYPE OF POPULATION Selection representative of population ? sampling methods - simple random sampling (may not be practical in national study) - stratified random sampling (in heterogenous pop./stratum) - multistage sampling (national-state-district-sub district-village) - cluster sampling
Data Collection Data collection begins after deciding on design of study and the sampling strategy
Data Collection Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.
Data Collection Information is collected on certain characteristics, attributes and the qualities of interest from the samples These data may be quantitative or qualitative in nature.
Types of Variables Qualitative - categorised based on characteristics which differentiate it e.g. ethnic - Malay, Chinese, Indian etc. Qualitative variables can be classed into nominal & ordinal. Quantitative - numerical values collected by observation, by measurement or by counting. Can either be discrete or continuous.
Variable Classification QuantitativeQualitative discrete - from Nominal - no rank counting ie no of nor specific order children/wives e.g. ethnic; M, C, I & continuous - can be in O. Ordinal - has fractions, from measurement e.g. rank/order between blood pressure, categories but the haemoglobin level. difference cannot be measured.
Types of DataTable 1.1 Exam ples of types of data QuantitativeContinuous DiscreteBlood pressure, height, w eight, age Number of children Number of attacks of asthma per w eek CategoricalOrdinal (Ordered categories) Nom inal (Unordered categories)Grade of breast cancer Sex (male/female)Better, same, w orse Alive or deadDisagree, neutral, agree Blood group O, A, B, ABhttp://www.bmj.com/collections/statsbk/
SO WHAT!So what’s the big deal about data types?
Statistical Tests - Qualitative
Type of Data Dictates Type of Analysis - Quantitative
Data Collection Techniques Use available information Observation Interviews Questionnaires Focus group discussion
Using Available Information Existing Records • Hospital records - case notes • National registry of births & deaths • Census data • Data from other surveys
Disadvantages of using existing records Incomplete records Cause of death may not be verified by a physician/MD Missing vital information Difficult to decipher May not be representative of the target group - only severe cases go to hosital
Disadvantages of using existing records Delayed publication - obsolete data Different method of data recording between institutions, states, countries, making comparison & pooling of data incompatible Comparisons across time difficult due to difference in classification, diagnostic tools etc
Advantages of using existing records Cheap convenient in some situations, it is the only data source i.e. accidents & suicides
Observation Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena Done using defined scales Participant observation e.g. PEF and asthma symptom diary Non-participant observation e.g. cholesterol levels
Interviews Oral questioning of respondents either individually or as a group. Can be done loosely or highly structured using a questionnaire
Administering Written Questionnaires Self-administered via mail by gathering them in one place and getting them to fill it up hand-delivering and collecting them later Large non-response can distort results
Questionnaires Influenced by education & attitude of respondent esp. for self-administered Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study
Issues at stake Content validity Structural validity Criterion validity
Focus group discussion Selecting relevant parties to the research questions at hand and discussing with them in focus groups examples in your own field of interest?
Source of biases during data collection Defective instruments • close ended questions with poor choice of options • open ended questions with no guidelines • vaguely-phrased questions • illogical sequences of questions • weighing scales that are not standardised
Source of biases during data collection Observer bias • reporting of radiographs Effect of interview on respondent Attitude of respondent • cough may be ignored by a smoker • stigmatised diseases may not be disclosed
Plan for data collection Permission to proceed Logistics - who will collect what, when and with what resources Quality control
Quality of Data How well do the variables designed for the study represent the phenomena of interest? E.g. How well does FBS represent control of diabetes
Accuracy & Reliability Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure Reliability is the consistency of replicate measures
Reliability & Accuracy
Accuracy & Reliability Both are reduced by random error and systematic error from the same sources of variability; • the data collectors • the respondents • the instrument
Strategies to enhance accuracy & reliability Standardise procedures and measurement methods training & certifying the data collectors Repetition Blinding
Data handling Check the data gathered storing of data - backup, backup & backup some more!
Data Management Data processing • Categorising • Coding • Data entry • Verification/validation
Labels & Coding
Variable Labels• Unique• Not more than 8 characters• Consists of letters and numbers only• Begins with a letter instead of a number.• Try to give a label that means something
Coding• Determine the coding to be used for each variable.• For qualitative variables, it is recommended to use numerical-codes to represent the groups; eg. 1 = male and 2 = female, this will also simplify the data entry process. The “danger” of using string/text is that a small “male” is different from a big “Male”,• see Table I.
Coding for Dichotomus Variable It is advisable to use 1=present, 0=absent. Or 1=higher risk, 0=lower risk
Coding for Missing Value @ blank responses Usually required only for qualitative variables Conventionally coded using a value that is not part of a valid response. For example; • Gender; M=1, F=2, MV=9 • Ethnic in East Malaysia; Codes 1 till 14 for races, MV=99
Advantage of Coding Reduce time for “data entry”. Make analysis possible e.g. SPSS wont analyse string responses of more than 8 characters Need a proper coding manual How to define variables and coding for application such as SPSS and Excel are available at the dept website http://184.108.40.206/spss/ http://220.127.116.11/excel/
Data Entry Independent operator verification Random check of data entered against the original <5% error by convention Some checks are built-in by the software i.e. EpiInfo