The basics of data collection, from defining data types to exploring measurement scales. We discussed and outlined various sources for data collection. Text, tables, and graphs are effective communication media that present and convey data and information. They aid readers in understanding the content of research, sustain their interest, and effectively present large quantities of complex information.
2. CONTENTS
• OBJECTIVES
• INTRODUCTION AND DEFINITION
• CLASSIFICATION OF DATA
• METHODS OF COLLECTION OF DATA
• DATA PRESENTATION
• CONCLUSION
• REFERENCES
02
3. LEARNING OBJECTIVES
1. To know about data.
2. To enumerate various types of data
3. To know about scales of measurement
4. To enumerate the methods for collection of data
5. To know about various methods of data presentation
03
4. INTRODUCTION
Data, the plural of datum, are facts expressed in numerical terms.
In statistical language, it is also called variable as it is a character, characteristics or quality that
varies.These do not convey any meaning by themselves. Hence, these have to be worked upon using a set
of statistical tools to convert them into meaningful information.
INFORMATION
STATISTICS
DATA
04
5. Characteristics of Data Collection
Data collection can be characterized by several important characteristics that help to ensure the
quality and accuracy of the data gathered. These characteristics include:
•Validity
•Reliability
•Objectivity
•Precision
•Timeliness
•Ethical considerations
05
6. ADVANTAGES OF DATA
•Better decision-making
•Improved understanding
•Evaluation of interventions
•Identifying trends and patterns
•Validation of theories
•Improved quality
06
7. Limitations of Data Collection
While data collection has several advantages, it also has some limitations that must be considered.
These limitations include:
•Bias
•Sampling bias.
•Cost
•Limited scope
•Ethical considerations
•Data quality issues
07
8. CLASSIFICATION OF DATA
A. Based on nature of variable
• Qualitative Data
• Quantitative Data
• Discrete Data
• Continuous Data
B. Based on sources
• Primary Data
• Secondary Data
D. Based on presentation
• Grouped Data
• Ungrouped Data
C. According to highest level which it fits
• Nominal Data
• Ordinal Data
• Interval Data
• Ratio data
08
9. •Qualitative data ~
Classification of data according to qualitative characteristics such as sex, honesty,
intelligence, literacy, colour, religion, marital status etc.
Fig-01
09
11. •Discrete data -
Classification of data which takes exact numerical values (whole numbers).
Eg: No of Children in a family, shoe size
11
Fig-03
12. •Continuous data -
Classification of data which takes numerical values within a certain range.
Eg: Weight of girl baby of one month is given as 3.8kg, but exact weight could be
between 3.2 and 5.4
12
Fig-04
13. • Primary data- Data which is directly collected by the researcher/investigator.
• Secondary data- Data which is not directly collected by the researcher/investigator.
Primary Quantitative Data:- Questionnaires
Structured Interviews
Secondary Quantitative Data:- Official statistics
Primary Qualitative Data:- Participant Observation
Unstructured interviews
• Secondary Qualitative Data: Letters, articles, newspapers
13
14. •Grouped data- Data which is presented in group Eg: Age: 20-25 (12 persons),25-30 (8 persons)…..
•Ungrouped data- Data which is presented individually
Eg: Age: 28 years, 27 years, 23 years, 25 years, 26 years.....
Another classification - acc to the highest level which it fits:-
→ Nominal - Lowest level - only names are meaningful. For ex- in a classroom student can be hindu,
muslim, christian, etc, so the student belongs to one category.
→Ordinal - Adds an order to the names. For ex- post surgical pain can be classified in to its severity: 0
means no pain, 1 means mild pain, 2 means moderate pain, 3 means severe pain.
→ Interval - Adds meaningful differences. No true zero For ex- Knoop hardness no. for composites.
→Ratio- Adds a zero so that ratios are meaningful. Has true zero or starting point. For ex- height, weight
length, etc. like twice the weight.
14
15. Main sources of for collection of data
A. Experiments
B. Surveys
C. Records
A. Experiments- Experiments are performed in the lab of various branches of medical sciences like physiology, biochem, pharmacology
and clinical pathology or in the hospital ward or in community.
B. Survey- Surveys are carried out for epidemiological studies in the field by trained teams
Are specially applied to generate data needed for specific purposes and comprises of primary data
Records provides readymade data for routine and continuous information which may be used for
research as secondary data
• To find the incidences or prevalence of health or diseases statistics in a community- like incidences of malaria
- prevalence of leprosy
• To identify risks factors associated with disease occurrence
• Also need in operational research such as assessment of existing conditions of a program, health services or
facility,
• Evaluating new strategies for prevention and control of health problems
15
16. Survey provides useful information like- A). Changing trends in health statics; morbidity; mortality; health practices etc.
B). Provide feedback to modify policy, system redefinition of objectives
C). Provide timely warning of public health hazards
C. Records- records are maintained as a routine in registers or books over a long period of time, for various purposes such as vital
statistics- births, marriage and deaths or for illness in hospitals.
There are various methods of data collection:-
Experiments
Surveys
Observation method
Interview method
Questionnaire method
Schedule method
• Other methods include warranty cards, pantry audits,
distributary audits, consumer panels, using mechanical
devices, through projective technique, depth interviews
and content analysis.
16
17. Data can be collected either through primary sources or secondary sources
Primary Sources- Here the data is obtained by the investigator himself. This is the first hand
information.
1. Observation method- This is the most frequently used in practice. Observation is said to be a
scientific tool and a means of data collection for the researcher.(9)
Types of Observation Methods
• Structured Observation
• Unstructured Observation
• Controlled Observation
• Uncontrolled Observation
• Participant Observation
• Non-participant Observation
• Disguised Observation
17
18. 2. Health interview survey- It is invaluable method of measuring subjective phenomena, such as
perceived morbidity, disability and impairments; opinions, beliefs and attributes and some behavioural
characteristics.
• Direct personal investigation
• Indirect oral investigation
• Easy to conduct in urbans
• Little use in developing countries
18
18
19. •Structured interviews: The questions are predetermined in both topic and order.
•Semi-structured interviews: A few questions are predetermined, but other questions aren’t planned.
•Unstructured interviews: None of the questions are predetermined.
•Focussed interview: focus attention on the given experience of the respondent.
3. Questionnaire Method- Standard method of data collection in clinical, epidemiological, psychosocial
and demographic research. It is used for measuring subjective phenomena.
19
20. WHAT IS QUESTIONNAIRE?
"A document containing set of questions logically related to the problem under
study.”
◦If the questions are filled by respondents, then its called as 'Questionnaire'
◦If filled by enumerators, it's called as ‘Schedule’
STRUCTURED QUESTIONNAIRES
Questionnaires in which there are definite, concrete and pre-determined questions. The questions are
presented with exactly the same wording and in the same order to all respondents.
The form of the question may be either closed (i.e., of the type 'yes' or 'no') or open (i.e., inviting free
response).
UNSTRUCTURED QUESTIONNAIRES
Interviewer is provided with a general guide on type of information to be obtained. Question formulation
is his own responsibility and replies taken down in respondent's own words.
20
21. GUTTMAN SCALE: (Cumulative)
◦Contain a series of statements that express
increasing intensity of a characteristic.
◦Respondent is asked to agree or disagree
with each statement.
◦Respondents score is the total number of
items with which he agrees or disagrees.
The 2 types of scales most commonly used are the Likert and Guttman scales.
LIKERT SCALE : (Summative)
◦Commonly used to quantify attitudes
& behaviour.
◦Respondents are asked to select a
response that best represents the rank
or degree of their answer.
◦Eg: respondent may be asked to
indicate whether he strongly agrees,
agrees, neither, disagrees, or strongly
disagrees with the statement.
21
23. HOW TO CONSTRUCT A QUESTIONNAIRE
Researcher should note the following with regard to these three main aspects of a questionnaire:
General form
Question Sequence
Determine the type the Questions :
A) Direct Question
B) Indirect Question
C) Open Form Questionnaire
D) Closed Form Questionnaire
E) Dichotomous Questions
F) Multiple Choice Questions (MCQ)
23
24. 5. SCHEDULE METHOD
◦A schedule is a structure of set of questions on a given topic which are asked by the interviewer or
investigator personally.
◦Like questionnaire but filled by enumerators who are especially appointed for filling questionnaire.
Questionnaire vs schedule
• Questionnaires generally sent
through mail and no further
assistance from sender.
• Questionnaire is cheaper method.
• Non response is high.
• In questionnaires identity of
respondent is unknown
• Very slow method
• No personal contact
• Schedule is generally filled by
enumerator or research worker
• Costly, requires field workers
• Non response is low
• In schedule identity of person is
known
• Information is collected well in
time
• Direct personal contact
24
25. Other Methods of Data Collection
•Warranty Cards: They are also called feedback cards. They are usually a postal size card with
some questions along with a request to the consumers to fill and return them.
•Distributor or Store Audit: This can be performed by distributers or manufacturers through their
sales representatives commonly and seasonal purchasing pattern.
•Pantry Audit: It is applied to estimate consumption of basket of goods at the consumer level.
•Consumer Panel: It is an extension of pantry audit. It is approached on a regular basis.
•Use of Mechanical Devices: Eye camera, pupilometric camera, psychogalvanometer, motion
picture camera
25
26. SECONDARY SOURCES
Secondary data means data that are already available i.e., they refer to the data which have already been
collected and analyzed by someone else. When the researcher utilizes secondary data, then he has to look into
various sources from where he can obtain them.
Published data
◦books, magazines and newspapers
◦reports prepared by research scholars,
universities historical documents
Unpublished data
diaries, letters, unpublished
biographies and autobiographies
26
27. 1) Published sources
a. Reports and of
fi
cial publications of i. International bodies such as World Health Organization
ii. Central and state governments such as Census data
iii. Reports of committees and commissions appointed by government
b. Semi of
fi
cial publications of various local bodies such as municipal corporations.
C. Publications of autonomous and private institutes such as
• Trade and professional bodies.
• Financial and economic journals.
• Annual reports of companies and corporations.
• Publications brought out by various autonomous research institutes and scholars.
2) Unpublished sources:- There are various unpublished data sources such as records
maintained by various government and private agencies, studies conducted by research
institutions, scholars etc... like dissertations of medical students of health university.
27
28. Factors to be considered before using secondary data
Reliability of data - Who, when , which methods, at what time etc.
Suitability of data - Object ,scope, and nature of original inquiry should be studied, as
if the study was with different objective then that data is not suitable for current study.
Adequacy of data- Level of accuracy, Area differences then data is not adequate for
study.
28
29. Selection of proper Method for collection of Data
1. Nature ,Scope and object of inquiry
2. Availability of Funds
3. Time Factor
4.Precision Required
29
30. 1) Census:- In India from the
fi
rst census of 1881, every 10 years census is taken. It
is de
fi
ned as "the total process of collecting, compiling and publishing
demographic, economic and social data pertaining to all persons in a country or
delimited territory at a speci
fi
ed time or times". Last census was held in March
2011. The data on age, sex, income and other basic information obtained in census
provides a base for planning, action and research in
fi
eld of medicine as well as other
sectors.
2)Registration of vital events :- In India, registration of births, deaths and marriages
is mandatory by law. This forms foundation of health and vital statistics.
3) Sample Registration System(SRS) :- It is a dual record system, consisting of
continuous enumeration of births and deaths by an enumerator and an independent
survey every 6 months by an investigator - supervisor. Due to complete coverage of
our country by SRS, we are able to get more reliable information on birth and death
rates, age speci
fi
c fertility, mortality rates and infant mortality.
SOURCES FOR COLLECTION OF DATA:
30
31. 4) Noti
fi
cation of diseases :- It is a valuable source of morbidity data such as incidence,
prevalence and distribution of certain speci
fi
ed diseases which noti
fi
able. Diseases to be
are noti
fi
ed are different in various countries as well as states in the same country.
Cholera, plague and yellow fever are internationally noti
fi
able diseases.
5) Hospital Records :- This forms basic and primary source of information about diseases
prevalent in the community due to the fact that in India registration of vital events is faulty
and noti
fi
cation of infectious diseases is far from adequate.
Serious limitation of hospital data is that it represents only those individuals who seek
medical care and we do not know the denominator due to lack of precise boundaries of the
catchment area of al hospital. Still it gives useful information regarding time, place and
person distribution of various diseases.
6) Epidemiological Surveillance :- Special surveillance activities are conducted for
diseases like malaria, AIDS in our country. This provides considerable morbidity and
mortality data for the speci
fi
c diseases. E.g. Sentinel surveillance data.
31
32. 7) Surveys :- Population surveys supplement routinely collected statistics. The term “health survey"
is used for surveys relating to any aspect of health-morbidty, mortality, nutritional status etc. When
main emphasis is on disease in the community the survey is labelled as "morbidity survey". These
surveys can be conducted for evaluating health status of a population, for investigation of
factors affecting health and disease or for improving administration of health services. These
surveys can be cross-sectional or longitudinal; descriptive or analytic or both. Methods used for
data collection in surveys include health interview, health examination, study of health records and
mailed questionnaires. eg. NFHS data.
8) Research Findings :- In various departments of Medical Colleges Hospitals experiments are
performed for investigations and research. Similarly in biomedical institutions & pharmaceutical
industries lot of research activities are conducted with speci
fi
c objectives. This data is useful for
planning and implementation of health activities in general. E.g. Dissertations, research papers.
32
33. DATA PRESENTATION
The objective of classification of data is to make the data simple, concise, meaningful and
interesting and helpful in further analysis.
DATA COLLECTED FROM VARIOUS EXPERIMENTS
COMPILATION AND CLASSIFICATION
PRESENTATION
33
34. Principles of presentation of data
• Data should be arranged in such a way that it will arouse interest in reader.
• The data should be made sufficiently concise without losing important details.
• The data should presented in simple form to enable the reader to form quick impressions and to
draw some conclusions, directly or indirectly.
• Should facilitate further statistical analysis.
• It should define the problem and suggest its solution.
34
35. The main methods of presenting frequencies of a variable or data:-
1.Textual
2. Tabulation
3. Charts and
Diagrams
METHODS OF PRESENTATION OF DATA:
35
TEXTUAL PRESENTATION OF DATA
In textual presentation, data are described within the text. When the quantity of data is not too large
this form of presentation is more suitable. Look at the following cases:
Case 1
In a bandh call given on 08 September 2005 protesting the hike in prices of petrol and diesel, 5 petrol
pumps were found open and 17 were closed whereas 2 schools were closed and remaining 9 schools
were found open in a town of Bihar.
36. Case 2
Census of India 2001 reported that Indian population had risen to 102 crore of which only 49 crore were
females against 53 crore males. Seventy-four crore people resided in rural India and only 28 crore lived in
towns or cities. While there were 62 crore non-worker population against 40 crore workers in the entire
country. Urban population had an even higher share of non-workers (19 crore) against workers (9 crore) as
compared to the rural population where there were 31 crore workers out of a 74 crore population...
In both the cases data have been presented only in the text. A serious drawback of this method of
presentation is that one has to go through the complete text of presentation for comprehension. But, it is
also true that this matter often enables one to emphasise certain points of the presentation.
Tabulation :-
It is the first step before the data is used for analysis or interpretation.
In the process of tabulation the following type of classification are encountered.
• Geographical i.e area wise
• Chronological i.e on the basis of time
• Qualitative i.e. according to attribute
• Quantitative i.e. in terms of magnitude
36
37. MEANING OF VARIOUS TERMS
Grouped Frequency Distribution: a frequency distribution when several numbers are grouped
in one class.
Class limits: Separates one class in a grouped frequency distribution from another. The limits
could actually appear in the data and have gaps between the upper limits of one class and lower limit
of the next.
Class boundaries: Separates one class in a grouped frequency distribution from another. The
boundaries have one more decimal places than the row data and therefore do not appear in the data.
There is no gap between the upper boundary of one class and lower boundary of the next class. The
lower class boundary is found by subtracting U/2 from the corresponding lower class limit and the
upper class boundary is found by adding U/2 to the corresponding upper class limit.
Class width: the difference between the upper and lower class boundaries of any class. It is also
the difference between the lower limits of any two consecutive classes or the difference between any
two consecutive class marks.
37
38.
Class mark (Mid points): it is the average of the lower and upper class limits or the average of
upper and lower class boundary.
Cumulative frequency: is the number of observations less than/more than or equal to a specific
value.
Relative frequency (rf): it is the frequency divided by the total frequency.
Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total
frequency.
Classi
fi
cation and tabulation are not two distinct processes but actually they go together,
classi
fi
cation is the
fi
rst step in tabulation.
38
39. A) Tabulation :
It is usually the
fi
rst step in presentation and analysis of data. A table can be simple or
complex depending upon the number of measurements of a single set or multiple sets of
items. Let us take an example to understand tabulation. Number of deaths due to neonatal
tetanus in 97 districts of India in one year are given below :-
70, 71, 72, 79, 84, 92,
141 70, 73, 73, 77, 79,
84, 93, 146, 74, 75, 77,
84, 70, 72, 88, 88, 141,
75, 77, 82, 93, 109, 134,
147, 79, 79, 87, 95, 107,
125, 140, 148 76, 78, 83,
106, 124, 135, 141, 71, 70,
79, 87, 97, 117, 140, 150,
160, 73, 78, 82, 103, 116,
135, 148 160, 160, 160, 160,
150, 88, 72, 76, 78, 84,
73, 80, 98, 113, 137, 157
74, 75, 81, 88, 99, 102,
113, 158, 84, 73, 76, 78,
82, 101, 113, 158, 88, 75,
108
39
40. • It is obvious that we can understand very little from the
fi
gures. A better way can be to arrange the
fi
gures
in an ascending or descending order, i.e. from 70 to 160, but still bulk of the data remains.
• A simpler method of reducing bulk of data can be tally mark method. In this method a vertical bar (I) is put
against the concerned number when it occurs. So if 70 occurs four times we represent it by IIII. For
fi
fth
observation, instead of a vertical bar we put a cross tally (/) on the
fi
rst four tallies. Thus we can get sets
of
fi
ve each. This representation of the data is known as frequency distribution.
• Neonatal deaths are called the variable (x) and number of districts against the neonatal deaths are
known as frequency (f) of the variable. The term 'frequency' is derived from 'how frequently' a variable
occurs.
Fig.05 example
40
41. In this example frequency of 73 neonatal deaths is 5 whereas frequency of 158 deaths is 3. Though this
method reduces the data to some extent, still it can not be called the best method.
In such a case to condense data further the observed range of variable can be divided in to suitable no class
intervals and no of observations in each class are recorded. Such a figure Fig. 06 showing the distribution of
frequencies in the different classes is called a frequency distribution table. And the manner in which the class
frequencies are distributed over the class intervals is called the grouped frequency distribution of the
variable.
The merits of a frequency distribution table are that,
• It shows at a glance how many individual observations are in
a group, and where the main concentration lies.
• It also shows the range, and the shape of the distribution.
41
Fig.06
42. Rules and guidelines for tabular presentation -
• A number should be assigned to the table (Table No.).
• A title should be given to the table, it should be concise and self explanatory.
• Contents of the table should be defined clearly.
• Subtitles should be properly mentioned with columns and rows
• Group intervals classes in columns and rows should neither be too narrow nor too wide. They should also
be mutually exclusive and non overlapping.
• Unit of measurement must be mentioned clearly where ever necessary.
• Number of classes should be neither too large nor small. There can be 10 to 20 classes. Following formula,
can be used to find out approximate number of "K" classes.
• K= 1 + 3.322 log10 N, Where N is the total frequency.
• Foot notes be given whenever necessary providing additional information, source or explanatory notes.
• Any short forms /symbols, if used should be explained in the footnote.
• No place should be left in the body of tables.
• There should be logical arrangement of data in the table.
42
46. 1. Classification by Space (geographical) :-
• Data are classified by location of occurrence.
• Arrangement of set of categories in alphabetical order of the terms defining these categories,
• In the order of their geographical location may be found to be suitable in many case.
Fig-11
46
47. 2. Chronological i.e. On the basis of time :-
• In this case data are classified by time of occurrence of the observations
• Arrangement of categories is almost always in chronological order
Fig-12
47
48. 3. Classification by attribute :-
• When the data represent observations made on a qualitative characteristic the classification in such
a case is made according to this qualities.
• Alphabetical arrangement of categories may be suitable for general purpose table.
• In the case of special purpose table arrangement may be made in the order of importance of these
categories.
Fig-13
48
49. 4. Classification by the size of observations :-
• When the data represent observations of some characteristic on a numerical scale, classification is
made on the basis of the individual observations.
• The range of observations is suitable divided into smaller divisions called class intervals.
• The numerical scale adopted may be either discrete or continuous.
Fig-14
49
50. Advantages of tabular presentation
• It is convenient and suf
fi
cient form for presenting the statistical information.
• It summarises the information and displays important features of it.
• Unnecessary repetitions that may appear in texts are avoided.
• Comparison between localities, age groups etc. can be made easily.
• Errors and omissions in the information can be easily detected.
• Reference to any details of the data is facilitated.
50
51. B) Presentation by Graphs and Diagrams:-
After class wise or group wise tabulation, the frequencies of a characteristic can be presented by two
kinds of drawings: Graphs and diagrams.
They may be shown either by lines and dots or by figures.
The drawings are meant for the non-statistical-minded people who want to study the relative
values or frequencies of persons or events.
For the statistical-mined persons, they are for quick eye readings.
Diagrams and graphs are extremely useful because:-
• They are attractive to the eyes.
• Give a birds eye view of the entire data.
• Have a lasting impression on the mind of the layman.
• Facilitate comparison of data.
51
52. Demerits of Diagrams:
Simplicity vs. Details: Diagrams often prioritize simplicity over details and accuracy.
Loss of Original Data: The simplicity in charts and diagrams may lead to the loss of
crucial details from the original data.
Need for Original Data: In-depth studies may require referring back to the original data.
Guidelines for Graphs, Figures, and Pictures:
Clear Titles: Ensure all graphs, figures, and pictures have clearly stated and informative
titles.
Labeling: Clearly label all classes and keys for better understanding.
Unit of Measurement: Include the appropriate unit of measurement for clarity.
52
53. Presentation of quantitative, continuous or measured data is through graphs.
The common graphs in use are:
Histogram
Frequency polygon
Frequency curve
Line chart or graph
Cumulative frequency diagram
Scatter or dot diagram
Bland–Altman plot
Forest plot
Presentation of qualitative, discrete or counted data is through diagrams.
The common diagrams in use are:
Bar diagram
Pie or sector diagram
Venn diagram
Pictogram or picture diagram
Map diagram or spot map.
53
54. Histogram
It is a graphical presentation of frequency distribution.
Variable characters of the different groups are indicated on the horizontal line (X-axis) called abscissa
while frequency, i.e. number of observations is marked on the vertical line (Y-axis) called ordinate.
Frequency of each group will form a column or rectangle. Such a diagram is called 'histogram' and
is made use of in presenting any quantitative data.
It is a bar diagram without gap between bars.
If we draw frequencies of each group or class intervals in the form of columns or rectangles such a
diagram is called histogram.
It represents a frequency distribution.
54
55. The histogram is constructed as follows:
• On the X axis, the size of the observation is marked.
• Starting from 0 the limit of each class interval is marked, the width corresponding to the width of
the class interval in the frequency distribution.
• On the Y axis the frequencies are marked.
• A rectangle is drawn above each class interval with height proportional to the frequency of that
interval.
Advantages of Histogram:
Easy to understand
Disadvantages of Histogram:
Only 1 histogram can be placed at a time.
More time consuming to construct than a frequency polygon.
55
56. Assessing the relationship between two variables
The forms of data presentation that have been
described up to this point illustrated the distribution
of a given variable, whether categorical or numerical.
In addition, it is possible to present the relationship
between two variables of interest, either categorical or
numerical.
The relationship between categorical variables
may be investigated using a contingency table, which
has the purpose of analyzing the association between
two or more variables. The lines of this type of table
usually display the exposure variable (independent
variable), and the columns, the outcome variable
(dependent variable). For example, in order to study
the effect of sun exposure (exposure variable) on the
development of skin cancer (outcome variable), it is
Weight at 18 years of age (in kg) Absolute frequency(n) Relative frequency (%)
40.5 to 59.9 554 25.25
60.0 to 65.8 543 24.75
65.9 to 74.6 551 25.11
74.7 to 147.8 546 24.89
Total 2.194 100.00
TABLE 3: Weight distribution among 18-year-old young male sex (n = 2.194). Pelotas, Brazil, 2010
0 20 40 60 80 100 120 140
Weight distribution at 18 years of age
40
30
20
10
0
FIGURE 4: Weight distribution at 18 years of age among youngsters
from the city of Pelotas. Pelotas (n = 2.194), Brazil, 2010
Weight distribution at 18 years of age
Percentage
Assessing the relationship between two variables
The forms of data presentation that have bee
described up to this point illustrated the distributio
of a given variable, whether categorical or numerica
In addition, it is possible to present the relationsh
between two variables of interest, either categorical
numerical.
The relationship between categorical variabl
may be investigated using a contingency table, whic
has the purpose of analyzing the association betwee
two or more variables. The lines of this type of tab
usually display the exposure variable (independe
variable), and the columns, the outcome variab
(dependent variable). For example, in order to stud
40.5 to 59.9 554 25.25
60.0 to 65.8 543 24.75
65.9 to 74.6 551 25.11
74.7 to 147.8 546 24.89
Total 2.194 100.00
0 20 40 60 80 100 120 140
Weight distribution at 18 years of age
40
30
20
10
0
Weight distribution at 18 years of age
Percentage
Weight distribution among 18-year-old young male sex (n = 2.194). Pelotas, Brazil, 2010.[12]
Weight distribution at 18 years of age among youngsters from the
city of Pelotas. Pelotas (n = 2.194), Brazil
Fig-15 56
58. Frequency polygon:
1. The most commonly used graphic device to illustrate statistical distribution.
2. Used to represent frequency distribution of quantitative data.
3. Useful to compare 2 or more frequency distributions.
• A frequency polygon is a variation of a histogram, in
which the bars are replaced by lines connecting the
midpoints of the tops of the bars.
• Advocates of the frequency polygon argue that the
purpose of a histogram is to show the shape of the
data distribution and removing the bars makes the
shape clearer and smoother.
Fig-17
58
59. Construction of frequency polygon:
• Variables is taken along the X axis and frequencies along the Y axis
• Class frequencies are plotted against the class mid-values and then these points are joined by a
straight line which gives a figure of frequency polygon.
• Total area under the frequency curve represents the total frequency.
Advantages of frequency polygon:
• It is very easy to construct and very easy to interpret.
• It is useful in portraying more than two distributions on the same graph paper with different
colours. So it is very useful to compare 2 or more than 2 distributions.
59
60. Frequency curve:-
When the number of observations are very large and class intervals very much reduced the
frequency polygon tends to loose its angulation and it forms a smooth curve known as frequency
curve.
• Variables is taken along the X axis and frequency along Y axis
• Frequencies are plotted against the class mid-values and then, these points are joined by a smooth
curve.
• The curve so obtained is the frequency curve.
• Total area under the frequency curve represents total frequency.
Fig-18
60
61. Line diagram:
• This diagram is useful to study changes of values in the variable overtime.
• Simplest type of diagram.
• On the X axis the time such as hours, days, weeks, months or years are represented.
• The value of any quantity pertaining to this is represented along the Y axis.
Fig-19
61
MTPs during 2002 to 2022
62. Cumulative frequency diagram or Ogive
• Ogive is a graph of the cumulative relative frequency distribution.
• To draw this, an ordinary frequency distribution table in a quantitative data has to be converted
into a cumulative frequency table.
• Cumulative frequency of a class interval is the total number of persons from lowest value of the
characteristic up to the highest value of the class under consideration. It is obtained by adding the
frequencies of previous classes including the class in question.
• Here the frequency of data in each category represents the sum of data from the category and the
preceding categories.
• Cumulative frequencies are plotted opposite the group limits of the variable.
• These points are joined by smooth free hand curve to get a cumulative frequency diagram or
Ogive.
62
64. Scatter diagram or dot diagram:
• It is a graphic presentation of data.
• It is used to show the nature of co-relation between 2 variables.
Also called as Correlation diagram ,it is useful to represent the relationship between two
numeric measurements, each observation being represented by a point corresponding to its value
on each axis.
If the data points make a straight line going from the origin out to high x
‐
and y
‐
values, then the
variables are said to have a positive correlation. If the line goes from a high value on the y
‐
axis
down to a high value on the x
‐
axis, the variables have a negative correlation. In case no trend was
shown, it is called no correlation.[10]
Fig-22
64
65. BLAND–ALTMAN PLOT
A Bland–Altman plot (difference plot) is a method of data plotting used in analyzing the agreement
between two different assays. In the Bland–Altman plot, the differences (between the two methods)
are plotted against the averages of the two methods. Alternatively, we can choose to plot the
differences (between the two methods) against one of the two methods, if this is a reference method
of both methods. Potassium level
(mEq/L) (Obtained
from venous blood
gas analysis)
Potassium level
(mEq/L) (Obtained
from blood
electrolyte levels)
Mean potassium
level (mEq/L)
Difference between
potassium levels
(mEq/L)
Patient Nr.
1
4.5 4.7 4.6 0.2
Patient Nr.
2
3.8 4.2 4.0 0.4
Patient Nr.
3
5.1 5.1 5.1 0.0
Patient Nr.
4
4.9 5.3 5.1 0.4
Patient Nr.
5
3.9 4.0 3.95 0.1
Patient Nr.
6
4.0 3.8 3.9 -0.2
Patient Nr.
7
4.1 4.0 4.05 -0.1
Patient Nr.
8
4.3 4.0 4.15 -0.3
Patient Nr.
9
5.3 5.3 5.3 0.0
Patient Nr.
10
5.2 5.1 5.15 -0.1
Patient Nr.
11
3.9 4.0 3.95 0.1
Patient Nr.
12
4.1 4.4 4.25 0.3
Patient Nr.
13
4.0 4.2 4.1 0.2
Patient Nr.
14
5.3 5.1 5.2 -0.2
Patient Nr.
15
5.5 5.3 5.4 -0.2
Patient Nr.
16
4.4 4.2 4.3 -0.2
Patient Nr.
17
4.9 5.0 4.95 0.1
Patient Nr.
18
3.7 3.9 3.8 0.2
Patient Nr.
19
3.9 3.7 3.8 -0.2
Patient Nr.
20
4.8 4.7 4.75 -0.1
Patient Nr.
21
5.5 5.2 5.35 -0.3
Patient Nr.
22
3.7 3.8 3.75 0.1
Patient Nr.
23
3.7 3.9 3.80 0.2
Patient Nr.
24
4.8 4.2 4.5 -0.6
Patient Nr.
25
5.1 5.6 5.35 0.5
Dataset for potassium levels in venous blood gases and blood electrolyte work-up.
65
66. For our dataset, the mean difference (mean bias) was found as 0.012 with an SD of 0.260. A scatterplot
should be drawn to understand dispersion of variables using X-axis (average) and Y-axis (difference). The
LOA can be drawn manually if the statistical software does not automatically demonstrate them. In our
data set, the upper limit can be calculated using mean + 1.96 x SD (0.012 + 1.96 x 0.260 = 0.522) and the
lower limit can be calculated using mean – 1.96 x SD (0.012–1.96 x 0.260 = –0.498). The appropriate
statement used in the manuscript can be following: The Bland-Altman plot showed the mean bias ±SD
between first and second potassium levels as 0.012 ± 0.260 mEq/L, and the limits of agreement were
−0.498 and 0.522[13]
Fig-22
Agreement between two potassium measurements (Bland-Altman plot).
66
67. FOREST PLOT
A forest plot, also known as a blobbogram, is a graphical display of estimated results from a
number of scientific studies addressing the same question, along with the overall results. It is a
graphical representation of a meta
‐
analysis. It is usually accompanied by a table listing references
(author and date) of the studies with their estimated result included in the meta
‐
analysis.[10]
Fig-24
67
68. f1
f2
f3
f4
f5
Factors
0.0 0.5 1.0 1.5
Odds ratio (95% CI)
2.0 2.5
*
*
Fig. 12. An example of a dot plot with an error bar. For each level
of factors (y-axis), corresponding odds ratio (OR) and 95% CIs are
presented using dots and accompanying horizontal error bar. The
dotted line indicates the reference value of 1. The estimated OR
would not be different from 1.0 statistically if its error bar crossed this
reference line.
An example of a dot plot with an error bar. For each level
of factors (y-axis), corresponding odds ratio (OR) and 95% CIs are
presented using dots and accompanying horizontal error bar. The
dotted line indicates the reference value of 1. The estimated OR
would not be different from 1.0 statistically if its error bar crossed this
reference line.
of the 95% CI of the estimated coefficient. The estimated regression line formula is a
Table 6. Estimated OR and 95% CI of Logistic Regression Model
Factor OR (95% CI) P value
F1 1.24 (1.12, 1.38)* < 0.001
F2 1.76 (1.26, 2.51)* 0.001
F3 1.10 (0.80, 1.50) 0.557
F4 1.00 (0.98, 1.02) 0.810
F5 1.09 (0.99, 1.20) 0.083
OR: odds ratio. *Two-sided P < 0.05.
Survival analysis
Survival analysis is a statistical method that can be applied to
mortality data and various types of longitudinal data. There are
various methods, from the nonparametric Kaplan-Meier method
to more complex methods involving different parametric models.
Kaplan-Meier survival analysis and Cox regression models are
widely used in the medical field. Survival analysis results usually
accompany the survival curve, which can increase the reader’s un-
derstanding of the results through visualization. For details on the
survival curve, refer to the previous Statistical Round article [5,6]. Dose-re
f1
f2
f3
f4
f5
Factors
0.0
Fig. 12.
of factor
presente
dotted l
would n
referenc
Estimated OR and 95% CI of Logistic Regression Model
Fig-25
68
69. Bar diagram
1. This diagram is used to represent qualitative data.
2. It represent only one variable.
3. The width of the bar remains the same and only the length varies according to the frequency in
each category.
There are 3 types of bars:
simple bar
multiple bar or compound bar
component bar diagram or proportional bar or stacked bar
69
70. Simple bar:
The limitation of this method is that they can represent only on the classification and hence cannot be
used for comparison.
Fig-26
70
Mortality due to various cases
Fig-27 Cases of gastroenteritis in a hospital in 2022
71. Multiple bar or compound bar:
Here two or more bars are grouped together, as in
fi
g.28 population of a country is shown with three
bars each showing population of Hindus, Muslims and others over two censuses. Fig.29 shows
sexwise and standard wise distribution of students passing from a school.
Fig-28 Population of a country as per the religion Fig-29 %of students passing in school
71
72. Component bar diagram:
• This diagram is used to represent qualitative data.
• It is desired to represent both the no of cases in major groups as well as the subgroups
simultaneously.
Fig-30
72
Expenditures on various items in two communities
Fig-31 Proportion of energy obtained from various food stuffs
by rich and poor community
73. Pie diagram:
• These are popularly used to show percentage break downs for qualitative data.
• It is so called because the entire graph looks like a pie and its components represent slices cut
from a pie.
• A circle is divided into different sectors corresponding to the frequencies of the distribution.
• Some knowledge of circles and degrees is necessary.
• The total angle at the center of the circle is 360 degrees and
it represents the total frequency.
• After the calculation of angle, segments are drawn in the
circle and the segments are shaded with different shades or
colors and an index is provided for the shaded colors.
• Cannot be used to represent 2 or more data set.
73
Fig-32 pattern of expenditure in an urban
community
74. hysterectomy), laparoscopic anterior resection of the colon, and TKRA.
TKRA: total knee replacement arthroplasty, RMW: regulated medical
waste (Adapted from Korean J Anesthesiol 2017; 70: 100-4).
Fig. 5. Pie chart. Total weight of each component from the three
operations. RMW: regulated medical waste (Adapted from Korean J
Anesthesiol 2017; 70: 100-4).
RMW
Blue wrap
Clear wrap
Plastics
Cardboard
29,344 g
2,102 g
2,838 g
2,388 g
1,564 g
the median and "whiskers" above a
of the minimum and maximum.
Fig. 7. Box graph with whiskers
consumed during the observat
significantly. Data are expressed
quartile, third interquartile, and m
from Korean J Anesthesiol 2017; 70
0
60
40
20
Control
Calculated
amount
of
consumption
volume
of
desflurane
(ml)
Pie chart. Total weight of each component from the three operations. RMW: regulated medical waste (Adapted from Korean J
Anesthesiol 2017; 70: 100-4).[11]
74
Fig-33
75. 75
Venn Diagram
• It shows the degrees of overlap and exclusivity for two or more characteristics or factors
within a sample or population (in which case each characteristic is represented by a
whole circle) or for a characteristic or factor among two or more samples or populations
(in which case each sample or population is represented by a whole circle).
• The sizes of the circles (or other symbols) need not be equal and may represent the
relative size for each factor or population.
Fig -34 No of covid cases as per reporting agency
76. Pictogram
• Display of data through pictograms was initiated by Dr Otto Neurath in 1923.
• Data are displayed by the pictures of the items to which the data pertain.
• A single picture represents a fixed no.
• They are the least satisfactory type of diagrams.
• They are inaccurate too.
Fig-35
76
77. Map diagram or spot map or cartograms:
1. These maps are used to show geographical distribution of frequencies of a characteristics such as
IMR, MMR, etc.
Estimated Infant Mortality Rate-2015
Fig-29
77
78. Other types of presentation of data
STEM AND LEAF-
• It is mainly used for the presentation of quantitative data.
• It is used to study the shape of the distribution.
• Can be used to compare two or more distributions.
• It is useful for smaller data set.
• It can be displayed by two whole digits, one for the stem and one for the leaf.
Consider this example of two groups of patients with hypertension having weights as given below:
Group I: 50, 51, 60, 62, 63, 65, 68, 74, 78, 82,83,84,85
Group II: 51, 52, 53, 54, 56, 58, 61, 63, 65, 67, 68, 71, 72, 80, 85
We can present this in tabular form as below:
78
Fig-30
79. Class intervals are represented by stem. For group one class intervals 50 to 59, 60 to 69, 70 to 79 and 80 to 89
are represented by stems 5, 6, 7 and 8 respectively. Now the weights of 51 and 68 are represented by leaf 1 to
stem 5 and leaf 8 to stem 6 respectively.
The stem and leaf plot for group I data can be shown as below:
The stem and leaf plot for group I and group Il data can be shown as below:
79
Fig-31
Fig-32
80. Box and whisker plot :
It is a representation of the quartiles (25%, 50% & 75% ) and the range of a continuous and ordered
data set. The y-axis can be arithmetic or logarithmic. Box plots can be used to compare different
distributions of data values.
Steps for drawing box and whisker plots:
• Determine from the given data set smallest, largest Q1,02 and Q3 i.e. first, second and third
quartile respectively.
• Mark the scale on X or Y axis
Draw a box (that is a rectangle with width as much as possible and length as Q3- Q1) with ends
through the points for the first and third quartiles.
• Draw a vertical line through the box at the median point (Q2)
• Draw the whiskers (lines) from each end of the box to the smallest and largest values.
80
81. Box plots characterize a sample using the minimum, 25th, 50th, and 75th percentiles, maximum
values. The interquartile range (IQR = Q3 − Q1, where Q1 is first quartile or 25th percentile while
Q3 is third quartile or 75th percentile) which covers the central 50% of the data. Quartiles are
insensitive to outliers and preserve information about the center and spread (variation). If a data point
is below Q1−1.5×IQR or above Q3+1.5×IQR ,it is viewed as being too far from the central values
(median), which are called outliers.
An example of a box-whisker plot. Estimated median (Q1, Q3)
[min:max] from the sample data is 1.1 (0.8, 1.3) [0.1:2.1]. This graph
includes explanations of the components of the box-whisker plot.
These are not necessary for the general purpose of publication. A
significance marker can be added, though it was not used in this
graph. If a significance maker is added, it should be located on the
shoulder or alongside the whisker. If markers are located over the
mid-top of the whiskers, these could be interpreted as outliers if no
detailed explanation is provided. The limits of the whiskers can be
varied depending on the purpose.
Fig-33
81
82. Fig-33 Box & whisker plot showing the distribution of height of boys in two classes A & B
82
83. Types of Charts Depending on the Method of Analysis of the Data
Analysis Subgroup Number of variables Type
Comparison Among items Two per items Variable width column
chart
One per item Bar/column chart
Over time Many periods Circular area/line chart
Few periods Column/line chart
Relationship Two Scatter chart
Three Bubble chart
Distribution Single Column/line histogram
Two Scatter chart
Three Three-dimensional area
chart
Comparison Changing over time Only relative di
ff
erences
matter
Stacked 100% column
chart
Relative and absolute
di
ff
erences matter
Stacked column chart
Static Simple share of total Pie chart
Accumulation Waterfall chart
Components of
components
Stacked 100% column
chart with
subcomponents
83
84. In conclusion we have covered the basics of data collection, from defining data types to
exploring measurement scales. We discussed and outlined various sources for data
collection. Text, tables, and graphs are effective communication media that present and
convey data and information. They aid readers in understanding the content of research,
sustain their interest, and effectively present large quantities of complex information. As
journal editors and reviewers will scan through these presentations before reading the
entire text, their importance cannot be disregarded. For this reason, authors must pay as
close attention to selecting appropriate methods of data presentation as when they were
collecting data of good quality and analyzing them. In addition, having a well-
established understanding of different methods of data presentation and their appropriate
use will enable one to develop the ability to recognize and interpret inappropriately
presented data or data presented in such a way that it deceives readers' eyes.
CONCLUSION
84
85. 1.Jay S. Kim And Ronald J. Dailey. Biostatistics For Oral Healthcare. Blackwell Publishing
Company.2008
2.C.R Kothari. Research Methodology methods and technologies. 4th edition. New age international
private Ltd publishers; 2019. reprint 2021
3.Khanal AB. Mahajan’s methods in biostatistics for medical students and research workers. 9th ed.
New Delhi, India: Jaypee Brothers Medical; 2015.
4.Dr. J.V Dixit. Principles and Practice Of Biostatistics. 8th edition.Bhanot
5. Rao TB. Methods of biostatistics. 3rd ed. Hyderabad: Paras Medical Publisher; 2010
6. C.M. Marya. A textbook of public health dentistry. 1st Edition. New Delhi: Jaypee Brothers Medical
Publishers. 2011
7.Mazhar SA, Anjum R, Anwar AI, Khan AA.Methods of Data Collection: A Fundamental Tool of
Research. J Integ Comm Health. 2021;10(1):6-10.
8.Researchgate.net. [cited 2023 Dec 18]. Available from: https://www.researchgate.net/publication/
325846997_METHODS_OF_DATA_COLLECTIONenrichId=rgreqf6733eb7ba5b1666d4b32342979e
ad09XXX&enrichSource=Y292ZXJQYWdlOzMyNTg0Njk5NztBUzo2NDE0NjI5MDc3MTU1ODVAMT
UyOTk0ODA4MzU4Ng%3D%3D&el=1_x_2&_esc=publicationCoverPdf
9.Bhandari P. Data collection [Internet]. Scribbr. 2020 [cited 2023 Dec 19]. Available from: https://
www.scribbr.com/methodology/data-collection/
86
REFERENCES
86. 10. Mishra P, Pandey CM, Singh U, Gupta A. Scales of measurement and presentation of statistical
data. Ann Card Anaesth 2018;21:419-22
11. Shinn HK, Hwang Y, Kim BG, Yang C, Na W, Song JH, et al. Segregation for reduction of
regulated medical waste in the operating room: a case report. Korean J Anesthesiol 2017; 70: 100-4.
12. Duquia RP, Bastos JL, Bonamigo RR, González-Chica DA, Martínez-Mesa J. Presenting data in
tables and charts. An Bras Dermatol. 2014;89(2):280-5.
13. Nurettin Özgür Doğan, Bland-Altman analysis: A paradigm to understand correlation and
agreement, Turkish Journal of Emergency Medicine, Volume 18, Issue 4, 2018, Pages 139-141
87