3. Learning Objectives
By the end of the session, participants will be able to:
1.Understand the general rules of appropriate data
management
2.Understand how to define roles and
responsibilities regarding data management
3.Utilize the information featured in the session to
implement a system for good data management
4. Introduction
• Data management is an important component in M&E
and deserves extra attention and diligence
• M&E teams should invest a significant part of their
time and effort in data management
• M&E teams should understand the basic concepts of
data management
• Data management policies and procedures should be
clearly defined
6. Data Capture, cont.
• Plan data capture carefully
• Decide on which software you will be using
• Define your database structure (tables or data files)
• Develop data entry screen (should be user-friendly and
include check for plausible values)
• Make provision for double- entry
8. Form/Questionnaire Flow
Field Supervisor
Filing Officer
Data Entry Supervisor
Interviewer
Data Entry
Database Manager
Questionnaire
completed
Questionnaire
fully entered
Questionnaire
with problems
9. Data Cleaning
• Check completeness of the data
• Check consistency- compare variables
• Check plausibility (value with acceptable range)
• Check for duplicates
• Check for outliers (run basic freq, mean)
11. Data Security
• Access to data should be restricted (Password)
• Final analytical data should be anonymous
• Make sure to do a regular data backup-daily-
weekly-monthly…
• If possible store a copy of your back up off-site
12. Other Aspects to Consider
Data Ownership: Who has the legal rights to the data
and who retains the data
Data Retention: Length of time one needs to keep the
project data
Data Sharing: How project data and results are
disseminated, and when data should not be shared
14. Session Objectives
1. Strengthen knowledge of terminology used in data
analysis and interpretation
2. Strengthen skills in data analysis and interpretation
3. Improve capacity to summarize data
4. Strengthen effective communication methods
15. What is Data Analysis?
• The process of understanding and explaining
what findings actually mean. Turning raw data
into useful information
• Provide answers to questions being asked at a
program site or research questions being
studied
• The greatest amount and best quality data
mean nothing if not properly analyzed, or, if not
analyzed at all
16. How would you analyze data to determine, “Is my
program meeting it’s objectives?”
Question Data Analysis Answers
Analysis is looking at the data in light of the questions
you need to answer
What is Data Analysis?, cont.
17. Is Our Program on Track?
• Analysis: Compare program targets and actual
program performance to learn how far you
are from target
• Interpretation: Why you have or have not
achieved the target and what this means for
your program
• May require more information
18. Examples of Analysis
Compare actual performance against targets
Indicator Progress (6/12/13) Target (1/30/14)
Number of persons trained on case
management
15 100
Comparing current performance to prior year
Indicator 2011 2012
No. of LLIN distributed 50,000 167,000
Compare performance between sites or groups
Indicator District A District B
Number of fever cases tested for
malaria by clinics
3,500 8,000
19. Statistical Measures
• Measure of central tendency
– Mean
– Median
– Mode
• Measure of variation
– Range
– Variance and standard deviation
– Interquartile range
– Proportion, Percentage
• Ratio, Rate
20. Mean
Sum of the values
divided by the
number of cases.
Also called average
̄y=
∑ yi
n
Month Cases 2008
Jan 30
Feb 45
Mar 38
April 41
May 37
Jun 40
Jul 70
Aug 270
Sep 280
Oct 200
Nov 100
Dec 29
∑ yi=1,180
n= 12
̄y=
1,180
12
=98.2
Average number of confirmed malaria cases
per month
Total number of cases
Number of observations
Mean number of cases
Very sensitive to variation
Number of observations
Mean number of cases
Total number of cases
21. Median
• Represents the middle
of the ordered sample
data
• For odd sample size, the
median is the middle
value
• For even, the median is
the midpoint/mean of
the two middle values
Month Cases
2008
Cases
2009
Dec 29 24
Jan 30 29
May 37 32
Mar 38 35
Jun 40 39
April 41 39
Feb 45 42
Jul 70 65
Nov 100 80
Oct 200 150
Aug 270 200
Sep 280 -
Median number of confirmed malaria cases
Not sensitive to variation
median=
41+45
2
=43
Median for 2008
Median for 2009
median= 39
22. Mode
• Value that occurs most
frequently
• It is the least useful (and
least used) of the three
measures of central
tendency
Month Cases
2008
Cases
2009
Dec 29 24
Jan 30 29
May 37 32
Mar 38 35
Jun 40 39
April 41 39
Feb 45 42
Jul 70 65
Nov 100 80
Oct 200 150
Aug 270 200
Sep 280 -
Mode number of confirmed malaria cases
mod e=none
Mode for 2008
Mode for 2009
39=mode
23. Practice Calculations
• What is the mode, mean
and median parasitemia
for the following set of
observations?
1.5, 1.8, 2.5, 4.1, 8.3, 1.2,
1.9, 0.6
• Answers:
– Mean = 2.74
– Median = 1.85
– Mode=none
– Would you use Mean or
Median?
– Answer: Median
– Use Median when you
have a large variation
between high and low
numbers
– Use Mean when there is
not a huge variation
between the values
24. Ratio
• Comparison of two numbers
• Expressed as:
– a to b, a per b, a:b
– 2 household members per (one) mosquito net, a
ratio of 3:1
• All individuals included in the numerator are not
necessarily included in the denominator
25. Proportion
• A ratio in which all individuals in the numerator are
also in the denominator
• Example: If a clinic has 12 female clients and 8 males
clients, then the proportion of male clients is 8/20 or
2/5
F F F F
F F F F
F F F F
M M M M
M M M M
26. Percentage
• A way to express a proportion
• Proportion multiplied by 100
• Example: Males comprise 2/5 of the
clients or, 40% of the clients are male
(0.40 x 100)
Important to know: What is the whole? An
orange? An apple? All clients? All clients on
with a fever?
27. Why do we want to know the
percentage?
• Helps us standardize so that we are able to
compare data across facilities, regions,
countries
• Better conceptualize what needs to be done
– Percentage helps us to track progress on our
targets
28. Rate
• A quantity measured with
respect to another
measured quantity
• Number of cases that occur
over a given time period
divided by population at risk
in the same time period
(Under five mortality rate)
Source: UNICEF: Statistics and Monitoring by Country
Nation Under five mortality
rate per 1,000 live
births in 2008
France 4
Ghana 76
Sierra Leone 194
Afghanistan 257
Probability of Dying Under Age Five per
1,000 Live Births
29. Annual Parasite Incidence (API)
Number of microscopically confirmed malaria cases
detected during one year per unit population
Confirmed malaria cases during 1 year
Population under surveillance
API X 1000
30. Most Common Software
• Microsoft Access
• Microsoft Excel
• Epi-Info
• SPSS
• Stata
• SAS
32. Learning objectives
1. Learn to calculate descriptive statistics and
run cross tabs in Excel and EpiInfo
2. Identify situations in which more
complicated analysis is necessary
33. MEASURE Evaluation is a MEASURE program project funded by the
U.S. Agency for International Development (USAID) through
Cooperative Agreement GHA-A-00-08-00003-00 and is implemented
by the Carolina Population Center at the University of North Carolina
at Chapel Hill, in partnership with Futures Group International, John
Snow, Inc., ICF Macro, Management Sciences for Health, and Tulane
University.
Visit us online at http://www.cpc.unc.edu/measure
Editor's Notes
Speaker Notes:
Today we will learn about basic data management and analysis.
Speaker Notes:
We will start by discussing data management. We discussed this briefly in the data quality session and will go into further detail here.
Speaker Notes:
By the end of the session, participants will be able to: [READ BULLETS]
Speaker Notes:
Data management is an important component in M&E and deserves extra attention and diligence. M&E teams should invest a significant part of their time and effort in data management because a well-functioning data management system can improve the quality of the data collected and facilitate analysis. At a minimum, M&E teams should understand the basic concepts of data management. At the top or central level and at lower levels of a data system or study, data management policies and procedures should be clearly defined. If only the top-level individuals know the policies and procedures for data management, but the individuals imputing data or managing the system on a day-to-day basis do not, you are likely to find extensive problems in the data quality.
Speaker Notes:
There are two main methods of data capture, on for paper-based data and another for paperless electronic data. Paper-based data generally includes forms, registries and questionnaires. These data must be put into a database in one manner or another. This is usually done using a computer and keyboard; however, smart phones and SMS technology are sometimes used to enter data and transmit it to a database. Issues in your data can arise at any step in this process.
Paperless data is collected electronically usually using a computer, PDA or smartphone. It is then put directly into a database using cables, internet, discs or USB drives. Since there are generally fewer steps to capturing paperless data, it is capable of being quicker and having less errors in the data capture process than paper-based data. Unfortunately, this technology is not always well understood. Individuals who are not tech savvy are likely to make errors. If an error is found, it may also be more difficult to find the original record as with paper-based data.
Speaker Notes:
To avoid errors and resulting data quality issues, it is important to plan data capture carefully. One of the first steps is to decide on which software, and sometimes hardware, you will be using. Some examples of software that can be used for data capture include CS-Pro, Microsoft Access or Excel, and EpiInfo. Next, you will need to define your database structure and create a data entry screen, which should be user-friendly and include check for plausible values. When possible, you should make provision for double-entry. This is most important for paper-based data, as errors are often introduced in the keying process.
Speaker Notes:
It is important right from the beginning to set data quality guidelines with clearly defined standards which will benchmark the quality of your data. By doing so, other users will trust your data. Three key steps should be considered.
Double entry refers to entering the same information twice by two independent persons. The two entries are then compared. You should ensure that the two entries match 100% for all the variables. For any mismatching information you have to refer to the original questionnaire/form to fix the problem.
Consistency/validation refers to conflicting information between two variables/questions. All related questions/variables must have at least 99% consistency. The data processing tools (data entry screen) should be configured to maximize consistency.
Error (Range check) refers to values out of acceptable ranges. Overall, 100% of all responses or values should be within the acceptable range. For example if your study population include women 15-49 years, then any age value outside this range should not be accepted. This check should be done at two levels: field supervision and data entry.
Speaker Notes:
Having a questionnaire or form aids the management of data and can enhance data quality. In this diagram you can see an ideal questionnaire or form flow. The data is collected by the interviewer or field personnel. The complete questionnaire is passed from the interviewer to the field supervisor to the filing officer, skipping the database manager to the data entry personnel.
If a questionnaire with problems is discovered during this process it is passed back down the chain, with each person examining the problem until it gets to the person who can resolve it, the database manager is not skipped.
Finally the fully entered questionnaire is sent from the data entry personnel to the filing officer.
Speaker Notes:
After data is in your data base it will need to be cleaned. This process involves checking: completeness of the data; consistency which can be checked by comparing variables; plausibility to determine whether data values fall within an acceptable range; and whether or not there are duplicate entries or outliers. Outliers can be found by running analysis on basic frequencies and means.
Speaker Notes:
This is a data cleaning trade off curve. You can see that as time and cost spent cleaning data increase, the data accuracy improves. However, there are clearly diminishing returns on investment. As more time is spent past a certain level, you see less improvements in data accuracy.
Speaker Notes:
Data security is important to maintain both confidentiality and the integrity of the data. Access to data should be restricted using a password and paper documents should be kept in a secured/locked location. Unique identifiers should be removed from the data set and the final analytical data should be anonymous. This is not to say that unique identifiers should be disposed of in all cases. In some studies, ethical review allows these to be kept for certain purposes. However, the person analyzing the data should not be able to identify individuals.
Make sure to do a regular data backup; this can occur daily, weekly and/or monthly, depending on the nature of the data. If possible, you should store a copy of your back up off-site. This guards you against threats to integrity of the data as you will always be able to go back to the original data set even if the one you have on site is corrupted or destroyed.
Speaker Notes:
There are a few other aspects regarding data management that need to be considered when undertaking any data collection effort. Data ownership, or who has the legal rights to the data and who retains the data, should be decided before any data is collected. This will avoid confusion and conflict later on.
A policy for data retention should specify the length of time one needs to keep the project data. For some projects, especially those on which a significant amount of publications are based, this may be a very considerable amount of time. This has implications on your data storage arrangements.
Finally, all parties should agree on how data and results will be shared and disseminate. A data sharing policy should clearly stipulate when data should and should not be shared.
Speaker Notes:
In the previous module, we discussed the importance of using data to make informed decisions. One of the limitations of data use mentioned was the limited ability of program staff to analyze and interpret data. For data to be useful, they need to be processed and summarized to become meaningful as they relate to the program.
The focus on this session is to present key concepts in data analysis. This session will review the most common data analysis terms and techniques used for descriptive data analysis. Then, in the next session, we’ll apply these techniques to the monitoring of health service delivery.
Speaker Notes:
The three objectives of this session are to :
1) Strengthen knowledge of terminology used in data analysis
2) Improve capacity to summarize data and present information
3) Strengthen skills in select data analysis and interpretation
4) Strengthen effective communication methods
Speaker Notes:
It is important to note that while the terms data and information are often used interchangeably; there is a distinction.
Data refers to raw, unprocessed numbers, measurements, or text.
Information refers to data that is processed, organized, structured or presented in a specific context. The process of transforming data into information is data analysis.
The purpose of data analysis is to provide answers to questions being asked at a program site or to research questions being studied.
Speaker Notes:
While computer packages may aid in performing some analysis, analysis does not mean using a complicated computer analysis package. It means taking the data that you collect and looking at it in comparison to the questions that you need to answer. For example, if what you need to know is if your program is meeting its objectives – or if its on track - you would look at your program targets and compare them to the actual program performance. This is analysis. We will later take this one step further and talk about interpretation (e.g., through analysis you find that your program achieved only 10% of its target; now you have to figure out why.)
Speaker Notes:
Suppose you need to know if your program is on track, you would probably look at your program targets and compare them to the actual program performance. This is analysis.
Interpretation is using the analysis to further understand your findings and the implications for your program. In many cases, this means using additional information, such as vital statistics, population-based surveys, and qualitative data, to supplement the routine service statistics. We will talk more about this later in the workshop.
Speaker Notes:
In this course, we will not be addressing all possible analyses that can be conducted at the program level – as there are many. We will be looking at some that MEASURE Evaluation staff have noted as common to the projects we work with. There are some common comparisons you can make when analyzing data:
Comparing actual performance (of specific indicators) against targets
Examples: No. of persons trained as of 6/12/13: 15 persons
Targeted no. of persons trained by 1/30/14: 100 persons
Comparing current performance to prior year
Example:
No. of LLIN distributed in 2011: 50,000
No. of LLIN distributed in 2012: 167,000
Compare performance between sites or groups
Example:
No. of fever cases tested for malaria by clinics in district A: 3,500
No. of fever cases tested for malaria by clinics in district B: 8,000
Speaker Notes:
This slide lists the basic statistical terms used in data analysis that we will cover in this session.
Speaker Notes:
The most commonly investigated characteristic of a collection of data (or dataset) is its center, or the point around which the observations tend to cluster. The mean is the most frequently used measure to look at the central values of a dataset.
The mean takes into consideration the magnitude of every value, which makes it sensitive to extreme values. If there are data in the dataset with extreme values – extremely low or high compared to most other values in the dataset – the mean may not be the most accurate method to use in assessing the point around which the observations tend to cluster.
Use the mean when the data is normally distributed (symmetric).
Speaker Notes:
The middle value of a set of data when data points are arranged from least to greatest value
The median is another measurement of central tendency but it is not as sensitive to extreme values as the mean because it takes into consideration the ordering and relative magnitude of the values. We therefore use the median when data are not symmetric or skewed.
If a list of values is ranked from smallest to largest, then half of the values are greater than or equal to the median and the other half are less than or equal to it.
When there is an even number of values, the median is the average of the two mid-point values.
For example, for the first list of cases on the slide (2008), there are an even number of values and we have taken the average of the 6th and 7th highest of 12 values. You add 41+45 to get 86, and then divide that by 2 to get 43.
When there is an odd number of values, the median is the middle value.
For example, for the 2nd list (2009), the median is 39.
Remember: with the median, you have to rank (or order) the figures before you can calculate it.
Speaker Notes:
The mode, which is used less often, is the value that occurs most frequently. If no values are repeated, there can be no mode.
The mode is the least useful (and least used) of the three measures of central tendency
Speaker Notes:
Ask participants to use the example provided to calculate the mean and median. Wait a few minutes then ask participants to share their answers. Address any confusion.
The mean is the sum of the values divided by the number of values
The median is the average of the two middle values (1.8 and 1.9) because there are an even number of values otherwise the median would be the middle value.
Speaker Notes:
A ratio is a comparison of two numbers and is expressed as “a to b” or “a per b”. A proportion is a ratio in which all individuals included in the numerator are not necessarily included in the denominator.
If we were to say there are 3 staff per clinic, the ratio is expressed numerically as 3:1. It is not the same as saying 1 to 3 or 1:3. The order of the numbers matters.
Speaker Notes:
A proportion is a ratio in which all individuals included in the numerator must also be included in the denominator.
For example, if a clinic has 12 female clients and 8 male clients, the denominator is total clients, male and female or 20. The proportion of male clients is eight-twentieths or two-fifths.
Speaker Notes:
A percentage is a way to express a proportion multiplied by 100. By calculating a percentage, we can compare data across facilities, regions, and countries.
Using the previous example, we saw that two-fifths of the clients are male. To make this a percentage, we convert the fraction to a decimal and multiply by 100 – 40%.
Remember, in this example the denominator includes all of the clients both male and female. It is important to know and express the nature of the denominator. What is the whole? Are we talking about all clients? All pregnant clients? All clients with a fever?
Speaker Notes:
A percentage is used to express a quantity relative to another quantity. It allows us to compare different population groups, facilities, countries which may have different denominators.
Percentages also allow us to understand what needs to be done by helping us track progress against targets and to look at our performance against quality of care indicators.
Speaker Notes:
In public health a rate is the number of cases that occur in a given time period divided by the population at risk during that time period. Since the number of occurrences of a specified outcome depends upon the size of the population being considered, dividing by their population sizes makes two groups more comparable.
A rate is often expressed per 1,000, 10,000 or 100,000 population.
A rate would be used most often in public health for expressing issues which occur infrequently, such as maternal mortality. It makes it easier to express 8 per 100,000 rather than .00008%. This is quite different from a ratio. In a ratio all individuals included in the numerator are not necessarily included in the denominator.
The under-five mortality rate is the probability (expressed as a rate per 1,000 live births) of a child born in a specified year dying before reaching the age of five if subject to current age-specific mortality rates.
Speaker Notes:
For example: To calculate the API, you can divide confirmed malaria cases by the total number of people under surveillance. The people under surveillance may be your entire population at risk or the number of people in your project area
Speaker Notes:
Some of the most common data analysis software are: Microsoft Access, Microsoft Excel; Epi-Info; SPSS; Stata and SAS. Most of you probably have access to the Microsoft programs on your computers. Epi-Info is free. The other three software are not free and some have a considerable cost. Fortunately, much of the data analysis for programs can be done using Microsoft Excel.
Speaker Notes:
Now we will do an exercise to demonstrate some of these data analysis concepts.
Speaker Notes:
By the end of the session, participants will be able to: [READ BULLETS]