3. Contents
• Data preprocessing
• „
Data cleaning
• „
Data integration and transformation
• Types of Analysis
• Model diagnosis and model building
4/19/2023 3
4. Introduction
• The findings presented need to be consistent with
the method(procedures, test will be used and
selected data analysis)
• The protocol should provide information on how
the data will be managed, including data coding for
computer analysis, monitoring and verification.
4/19/2023 4
5. Introduction
What is expected from a plan of analysis:
Methods and models of data analysis according to
types of variables
– Based on the proposed objectives and types of
variables, the investigator should specify how the
variables will be measured and how they will be
presented (quantitative and/or qualitative),
indicating the analytical models and techniques.
4/19/2023 5
6. Introduction
What is expected from a plan of analysis:
– The investigator should provide a preliminary
scheme for tabulating the data.
– It is recommended that special attention be given
to the key variables that will be used in the
statistical models.
4/19/2023 6
7. Programs to be used for data analysis:
– Briefly describe the software packages that will be
used and their anticipated applications.
– Power of the study, level of significance to be used,
procedures for accounting for any missing or
spurious data, etc.
– For projects involving qualitative approaches, specify
in sufficient detail how the data will be analyzed.
4/19/2023 7
8. Data Analysis
Steps in Data Analysis
• Data Collection & Preparation
• Exploration of Data
• Data Analysis Method (s)/ Techniques
4/19/2023 8
9. Data Preparation
• Collect data
• Preparation of code books
• Set up structure of data
• Enter data
• Screen data for errors
4/19/2023 9
10. Exploration Of Data
• Graphs
• Descriptive statistics
Data processing
• Any operation or set of operations performed upon
data, whether or not by automatic means, such as
collection, recording, organization, storage,
adaptation or alteration to convert it into useful
information.
4/19/2023 10
11. Why Data Processing?
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes
or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
4/19/2023 11
12. Steps of data processing
• There are 5 steps included in Data processing:
– Editing
– Coding
– Classification
– Data Entry
– Validation
– Tabulation
4/19/2023 12
13. Data Editing
• Editing of data is a process of examining the
collected raw data to detect errors and
omissions and to correct these when possible.
• With regards to stages:
1. Field Editing
2. Central Editing
4/19/2023 13
14. Coding
• Coding refers to process of assigning numerals or
other symbols to answers so that responses can be
put into a limited number of categories or classes
Data Entry
• After the data has been properly arranged and
coded, it is entered into the software that performs
the eventual cross tabulation.
• Data entry professionals do the task efficiently.
4/19/2023 14
15. Validation
• After the cleaning phase, comes the validation
process.
• It refers to the process of thoroughly checking the
collected data to ensure optimal quality levels.
• All the accumulated data is double checked in
order to ensure that it contains no inconsistencies
and is relevant.
4/19/2023 15
16. Tabulation
• Tabulation is the process of summarizing raw data and
displaying the same in compact form for further
analysis.
Benefits:
1. It reduces explanatory statement to a minimum
2. It facilitates the process of comparison
3. It facilitates the summation of items and detection of
errors
4. It provides a basis for various statistical computations
4/19/2023 16
18. Descriptive statistics
• Descriptive statistics is the term given to the
analysis of data that helps describe, show or
summarize data in a meaningful way such that
patterns might emerge from the data.
• Does not allow us to make conclusions beyond the
data we have analyzed or reach conclusions
regarding any hypotheses we might have made.
• Simply a way to describe the data.
4/19/2023 18
19. Inferential statistics
• Inferential statistics is concerned with making
predictions or inferences about a population from
observations and analysis of a sample.
• We can take the results of an analysis using a
sample and can generalize it to the larger population
that the sample represents.
4/19/2023 19
20. Major Tasks in Data Preprocessing
Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, duplicate records and resolve
inconsistencies
Data integration
• Integration of multiple databases, data cubes, or files
Data transformation
• Normalization and aggregation
4/19/2023 20
22. Three types of analysis
Univariate analysis
Analyzing and presenting the information relating to a
single variable (e.g., weight of study participants)
Bivariate analysis
The examination of two variables simultaneously (e.g., the
relation between gender and weight of study participants)
Multivariate analysis
• The examination of more than two variables simultaneously
(e.g., the relationship between gender, race and weight of
study participants)
4/19/2023 22
23. Types of analysis and its intention
I. Univariate analysis
Purpose:
– Mainly description
– To gain an understanding of the distribution of data
Excluding the variables from further analysis if they
have
• A little variability
• A high number of missing observations
4/19/2023 23
24. I. Univariate analysis
• To inspect the distribution of explanatory variables
– For categorical variables, create their
contingency tables(will reveal any cells with low
(<5) or zero frequency).
– For continuous variables, estimate means and
standard deviations.
4/19/2023 24
25. Types of analysis and its intention
II. Bivariate analysis
Purpose:
– Determining the empirical relationship between
the two variables
– We test the association without worrying about
other variables or confounders
– This is essential in order to shortlist variables for
multivariable analysis
4/19/2023 25
26. Types of analysis and its intention
III. Multivariate analysis
Purpose:
– In this step we test associations of variables with the
outcome after accounting for other variables and
confounders.
4/19/2023 26