2. *
Data processing occurs when data is collected and translated into
usable information.
Usually performed by a data scientist or team of data scientists, it is
important for data processing to be done correctly as not to
negatively affect the end product, or data output.
Data processing is concerned with Editing, Coding, Classifying,
Tabulating and Charting and Diagramming research data.
Data Processing in research consists of five important steps:
1. Editing of Data
2. Coding of Data
3. Classification of Data
4. Tabulation of Data
5. Data diagram
3. *
*Data editing is the application of checks to detect missing, invalid
or inconsistent entries or to point to data records that are
potentially in error. No matter what type of data you are working
with, certain edits are performed at different stages or phases of
data collection and processing.
Purpose of Data Editing
1. Clarify Responses
2. Make omissions
3. Avoid biased editing
4. Make judgements
5. Logical adjustments
4. *
*Data coding in research methodology is a preliminary step to
analyzing data. The data that is obtained from surveys,
experiments or secondary sources are in raw form.
*This data needs to be refined and organized to evaluate and draw
conclusions.
*Data coding is not an easy job and the person or persons involved
in data coding must have knowledge and experience of it.
*Data coding is the process of converting data into a form that can
be analyzed. It involves assigning numerical or categorical codes
to data items, such as responses to survey questions or
demographic information. Coded data can then be analyzed using
statistical software or other tools.
5. *
*Classification is the way of arranging the data in different classes in
order to give a definite form and a coherent structure to the data
collected, facilitating their use in the most systematic and
effective manner.
Objectives of classification of data
*To group heterogeneous data under the homogeneous group of
common characteristics
*To facility similarity of various group
*To facilitate effective comparison
*To present complex, haphazard and scattered dates in a concise,
logical, homogeneous, and intelligible form
*To maintain clarity and simplicity of complex data
*To identify independent and dependent variables and establish
their relationship
6. *
*Tabulation is a method of presenting numeric data in rows and columns in
a logical and systematic manner to aid comparison and statistical analysis.
It allows for easier comparison by putting relevant data closer together,
and it aids in statistical analysis and interpretation.
*Tabulation, in other terms, is the process of arranging organized data into
a tabular format.
*Depending on the nature of the classification, it might be complicated,
double, or simple.
*The goal of a tabulation chart/data is to present a significant amount of
complicated data in a systematic way that allows readers to derive logical
conclusions and interpretations from it.
Objectives of Tabulation
*For the Purpose of Data Simplification
*To Draw Attention to Important Information
*To Make Comparisons Easier
*To Assist with Data Statistical Analysis
*Conserves space
7. *
*Diagrams have been used to collect data from research subjects
by asking them to either draw a diagram themselves or modify a
prototypic diagram supplied by the researcher.
*The use of diagrams in data collection has been viewed favorably
in helping to gather rich data on healthcare topics.
*Diagrams and charts are important because they present
information visually.
*The adage “a picture is worth a thousand words” applies when it
comes diagrams and charts. This handout provide a few hints on
understanding information visually
8. Creative presentation of data is possible. The data diagrams
classified into:
*Charts: A chart is a diagrammatic form of data presentation. Bar
charts, rectangles, squares and circles can be used to present data.
Bar charts are uni-dimensional, while rectangular, squares and
circles are two-dimensional.
*Graphs: The method of presenting numerical data in visual form is
called graph, A graph gives relationship between two variables by
means of either a curve or a straight line. Graphs may be divided
into two categories.
(1)Graphs of Time Series and
(2) Graphs of Frequency Distribution.
In graphs of time series one of the factors is time and other or others
is / are the study factors. Graphs on frequency show the distribution
of by income, age, etc. of executives and so on.
9. *Collection of Data:The very first challenge in data processing
comes in the collection or acquisition of the correct data for the
input.
The challenge here is to collect the exact data to get the
proper result. The result directly depends on the input data.
Hence, it is vital to collect the correct data to get the desired
result.
*Duplication of data: As the data is collected from different data
sources, it often happens that there is duplication in data. The
same entries and entities may present a number of times during
the data encoding stage. This duplicate data is redundant and
may produce an incorrect result.
Hence, we need to check the data for duplication and
proactively remove the duplicate data.
10. *Inconsistency of Data: When we collect a huge amount of data,
there is no guarantee that the data would be complete or all the
fields that we need are filled correctly. Moreover, the data may
be ambiguous.
As the input/raw data is heterogeneous in nature and is
collected from autonomous data sources, the data may conflict
with each other.
*Variety of data: The input data, as it is collected from different
sources, can contain different forms. The rows and columns of a
relational database don’t limit the data.
The data varies from application to application and source to
source. Much of this data is unstructured and cannot fit into a
spreadsheet or a relational database.
*Data Integration: Data integration means to combine the data
from various sources and present it in a unified view.
With the increased variety of data and different formats of
data, the challenge to integrate the data becomes bigger.
11. *
Data analysis is an aspect of data science that is all about analyzing
data for different kinds of purposes. It involves inspecting, cleaning,
transforming and modeling data to draw useful insights from it.
WHAT ARE THE DIFFERENT TYPES OF DATA ANALYSIS?
*Descriptive analysis
*Exploratory analysis
*Inferential analysis
*Predictive analysis
*Causal analysis
*Mechanistic analysis
12. 1. DESCRIPTIVE ANALYSIS
The goal of descriptive analysis is to describe or summarize a set of data. Here’s
what you need to know:
Descriptive analysis is the very first analysis performed.
It generates simple summaries about samples and measurements.
It involves common, descriptive statistics like measures of central tendency,
variability, frequency, and position.
2. EXPLORATORY ANALYSIS (EDA)
Exploratory analysis involves examining or exploring data and finding
relationships between variables that were previously unknown. Here’s what you
need to know:
EDA helps you discover relationships between measures in your data, which are
not evidence for the existence of the correlation, as denoted by the phrase,
“Correlation doesn’t imply causation.”
It’s useful for discovering new connections and forming hypotheses. It drives
design planning and data collection.
13. 3. INFERENTIAL ANALYSIS
Inferential analysis involves using a small sample of data to infer information about a
larger population of data.
The goal of statistical modeling itself is all about using a small amount of information
to extrapolate and generalize information to a larger group. Here’s what you need to
know:
Inferential analysis involves using estimated data that is representative of a
population and gives a measure of uncertainty or standard deviation to your
estimation.
The accuracy of inference depends heavily on your sampling scheme. If the sample
isn’t representative of the population, the generalization will be inaccurate. This is
known as the central limit theorem.
4. PREDICTIVE ANALYSIS
Predictive analysis involves using historical or current data to find patterns and make
predictions about the future. Here’s what you need to know:
The accuracy of the predictions depends on the input variables.
Accuracy also depends on the types of models. A linear model might work well in some
cases, and in other cases it might not.
Using a variable to predict another one doesn’t denote a causal relationship.
14. 5. CAUSAL ANALYSIS
Causal analysis looks at the cause and effect of relationships between variables and is
focused on finding the cause of a correlation. Here’s what you need to know:
To find the cause, you have to question whether the observed correlations driving
your conclusion are valid. Just looking at the surface data won’t help you discover
the hidden mechanisms underlying the correlations.
Causal analysis is applied in randomized studies focused on identifying causation.
6. MECHANISTIC ANALYSIS
Mechanistic analysis is used to understand exact changes in variables that lead to
other changes in other variables. Here’s what you need to know:
It’s applied in physical or engineering sciences, situations that require high
precision and little room for error, only noise in data is measurement error.
It’s designed to understand a biological or behavioral process, the
pathophysiology of a disease or the mechanism of action of an intervention.
15. Descriptive analysis summarizes the data at hand and presents
your data in a comprehensible way.
Exploratory data analysis helps you discover correlations and
relationships between variables in your data.
Inferential analysis is for generalizing the larger population
with a smaller sample size of data.
Predictive analysis helps you make predictions about the future
with data.
Causal analysis emphasizes on finding the cause of a correlation
between variables.
Mechanistic analysis is for measuring the exact changes in
variables that lead to other changes in other variables.
16. *
*A hypothesis is an assumption that is made based on some evidence.
*This is the initial point of any investigation that translates the research
questions into predictions.
*It includes components like variables, population and the relation
between the variables.
*A research hypothesis is a hypothesis that is used to test the relationship
between two or more variables.
Sources of Hypothesis
Following are the sources of hypothesis:
*The resemblance between the phenomenon.
*Observations from past studies, present-day experiences and from the
competitors.
*Scientific theories.
*General patterns that influence the thinking process of people.
17. Characteristics of Hypothesis
Following are the characteristics of the hypothesis:
1. The hypothesis should be clear and precise to consider it to be reliable.
2. If the hypothesis is a relational hypothesis, then it should be stating the relationship between
variables.
3. The hypothesis must be specific and should have scope for conducting more tests.
4. The way of explanation of the hypothesis must be very simple and it should also be understood
that the simplicity of the hypothesis is not related to its significance.
Examples of Hypothesis
Following are the examples of hypotheses based on their types:
1. Consumption of sugary drinks every day leads to obesity is an example of a simple hypothesis.
2. All lilies have the same number of petals is an example of a null hypothesis.
If a person gets 7 hours of sleep, then he will feel less fatigue than if he sleeps less. It is an
example of a directional hypothesis.
18. Types of Hypothesis
There are six forms of hypothesis and they are:
1. Simple hypothesis
2. Complex hypothesis
3. Directional hypothesis
4. Non-directional hypothesis
5. Null hypothesis
6. Associative and casual hypothesis
1.Simple Hypothesis
It shows a relationship between one dependent variable and a single independent variable. For
example – If you eat more vegetables, you will lose weight faster. Here, eating more vegetables
is an independent variable, while losing weight is the dependent variable.
2. Complex Hypothesis
It shows the relationship between two or more dependent variables and two or more
independent variables. Eating more vegetables and fruits leads to weight loss, glowing skin, and
reduces the risk of many diseases such as heart disease.
19. 3. Directional Hypothesis
It shows how a researcher is intellectual and committed to a particular outcome. The
relationship between the variables can also predict its nature. For example- children aged
four years eating proper food over a five-year period are having higher IQ levels than children
not having a proper meal. This shows the effect and direction of the effect.
4. Non-directional Hypothesis
It is used when there is no theory involved. It is a statement that a relationship exists
between two variables, without predicting the exact nature (direction) of the relationship.
5. Null Hypothesis
It provides a statement which is contrary to the hypothesis. It’s a negative statement, and
there is no relationship between independent and dependent variables. The symbol is denoted
by “HO”.
6. Associative and Causal Hypothesis
Associative hypothesis occurs when there is a change in one variable resulting in a change in
the other variable. Whereas, the causal hypothesis proposes a cause and effect interaction
between two or more variables.
20. *
Hypothesis testing is a systematic procedure for deciding whether the
results of a research study support a particular theory which applies to a
population. Hypothesis testing uses sample data to evaluate a hypothesis
about a population.
Hypothesis testing in statistics refers to analyzing an assumption about a
population parameter. It is used to make an educated guess about an
assumption using statistics. With the use of sample data, hypothesis testing
makes an assumption about how true the assumption is for the entire
population from where the sample is being taken.
For example, you might implement protocols for performing intubation on
pediatric patients in the pre-hospital setting.
To evaluate whether these protocols were successful in improving
intubation rates, you could measure the intubation rate over time in one
group randomly assigned to training in the new protocols, and compare this
to the intubation rate over time in another control group that did not
receive training in the new protocols.
21. Five Steps in Hypothesis Testing:
1.Specify the Null Hypothesis
2.Specify the Alternative Hypothesis
3.Set the Significance Level (a)
4.Calculate the Test Statistic and Corresponding P-Value
5.Drawing a Conclusion
22. Step 1: Specify the Null Hypothesis
The null hypothesis (H0) is a statement of no effect, relationship, or difference
between two or more groups or factors. In research studies, a researcher is usually
interested in disproving the null hypothesis.
Examples:
There is no difference in intubation rates across ages 0 to 5 years.
The intervention and control groups have the same survival rate (or, the intervention
does not improve survival rate).
Step 2: Specify the Alternative Hypothesis
The alternative hypothesis (H1) is the statement that there is an effect or
difference. This is usually the hypothesis the researcher is interested in proving. The
alternative hypothesis can be one-sided (only provides one direction, e.g., lower) or
two-sided.
We often use two-sided tests even when our true hypothesis is one-sided because it
requires more evidence against the null hypothesis to accept the alternative
hypothesis.
Examples: The intubation success rate differs with the age of the patient being
treated (two-sided).
The time to resuscitation from cardiac arrest is lower for the intervention group than
for the control (one-sided).
23. Step 3: Set the Significance Level (a)
The significance level (denoted by the Greek letter alpha— a) is generally set at
0.05. This means that there is a 5% chance that you will accept your alternative
hypothesis when your null hypothesis is actually true.
The smaller the significance level, the greater the burden of proof needed to reject
the null hypothesis, or in other words, to support the alternative hypothesis.
Step 4: Calculate the Test Statistic and Corresponding P-Value
In another section we present some basic test statistics to evaluate a hypothesis.
Hypothesis testing generally uses a test statistic that compares groups or examines
associations between variables.
When describing a single sample without establishing relationships between
variables, a confidence interval is commonly used.
The p-value describes the probability of obtaining a sample statistic as or more
extreme by chance alone if your null hypothesis is true.
This p-value is determined based on the result of your test statistic. Your conclusions
about the hypothesis are based on your p-value and your significance level.
24. Step 5: Drawing a Conclusion
P-value <= significance level (a) => Reject your null hypothesis in favor of your
alternative hypothesis. Your result is statistically significant.
P-value > significance level (a) => Fail to reject your null hypothesis. Your result
is not statistically significant.
Hypothesis testing is not set up so that you can absolutely prove a null
hypothesis. Therefore, when you do not find evidence against the null hypothesis,
you fail to reject the null hypothesis. When you do find strong enough evidence
against the null hypothesis, you reject the null hypothesis.
Your conclusions also translate into a statement about your alternative
hypothesis.
25. *
*In statistics, a Type I error is a false positive conclusion, while a Type II
error is a false negative conclusion.
*Making a statistical decision always involves uncertainties, so the risks of
making these errors are unavoidable in hypothesis testing.
*The probability of making a Type I error is the significance level, or alpha
(α), while the probability of making a Type II error is beta (β). These risks
can be minimized through careful planning in your study design.
*Example: Type I vs. Type II error You decide to get tested for COVID-19 based
on mild symptoms. There are two errors that could potentially occur:
*Type I error (false positive): the test result says you have coronavirus, but
you actually don’t.
*Type II error (false negative): the test result says you don’t have
coronavirus, but you actually do.
26.
27. Type I Error
A type I error appears when the null hypothesis (H0) of an experiment is true, but
still, it is rejected. It is stating something which is not present or a false hit.
A type I error is often called a false positive (an event that shows that a given
condition is present when it is absent). In words of community tales, a person
may see the bear when there is none (raising a false alarm) where the null
hypothesis (H0) contains the statement: “There is no bear”.
The type I error significance level or rate level is the probability of refusing the
null hypothesis given that it is true. It is represented by Greek letter α (alpha)
and is also known as alpha level.
Usually, the significance level or the probability of type i error is set to 0.05 (5%),
assuming that it is satisfactory to have a 5% probability of inaccurately rejecting
the null hypothesis.
28. Type II Error
A type II error appears when the null hypothesis is false but mistakenly fails to be
refused. It is losing to state what is present and a miss.
A type II error is also known as false negative (where a real hit was rejected by the
test and is observed as a miss), in an experiment checking for a condition with a final
outcome of true or false.
A type II error is assigned when a true alternative hypothesis is not acknowledged. In
other words, an examiner may miss discovering the bear when in fact a bear is present
(hence fails in raising the alarm).
Again, H0, the null hypothesis, consists of the statement that, “There is no bear”,
wherein, if a wolf is indeed present, is a type II error on the part of the investigator.
Here, the bear either exists or does not exist within given circumstances, the question
arises here is if it is correctly identified or not, either missing detecting it when it is
present, or identifying it when it is not present.
The rate level of the type II error is represented by the Greek letter β (beta) and
linked to the power of a test (which equals 1−β).
29. Chi-Square Test
A chi-squared test (symbolically represented as χ2) is basically a data analysis on
the basis of observations of a random set of variables. Usually, it is a comparison of
two statistical data sets.
This test was introduced by Karl Pearson in 1900 for categorical data analysis and
distribution. So it was mentioned as Pearson’s chi-squared test.
The chi-square test is used to estimate how likely the observations that are made
would be, by considering the assumption of the null hypothesis as true.
A hypothesis is a consideration that a given condition or statement might be true,
which we can test afterwards.
Chi-squared tests are usually created from a sum of squared falsities or errors
over the sample variance.
Finding P-Value
P stands for probability here. To calculate the p-value, the chi-square test is used
in statistics. The different values of p indicates the different hypothesis
interpretation, are given below:
P≤ 0.05; Hypothesis rejected
P>.05; Hypothesis Accepted
30. Formula
The chi-squared test is done to check if there is any difference between the
observed value and expected value. The formula for chi-square can be written as;
or
χ2 = ∑(Oi – Ei)2/Ei
where Oi is the observed value and Ei is the expected value.
31. *An F-test is any statistical test in which the test statistic has an F-
distribution under the null hypothesis. It is most often used when
comparing statistical models that have been fitted to a data set, in
order to identify the model that best fits the population from which
the data were sampled.
32.
33. T- Test
A t test is a statistical test that is used to compare the means of two groups.
It is often used in hypothesis testing to determine whether a process or treatment
actually has an effect on the population of interest, or whether two groups are
different from one another.
When to use a t test?
A t test can only be used when comparing the means of two groups (a.k.a. pairwise
comparison). If you want to compare more than two groups, or if you want to do
multiple pairwise comparisons, use an ANOVA test or a post-hoc test.
The t test is a parametric test of difference, meaning that it makes the same
assumptions about your data as other parametric tests.
34.
35. Z Test
Z test is a statistical test that is conducted on data that approximately follows a
normal distribution.
The z test can be performed on one sample, two samples, or on proportions for
hypothesis testing.
It checks if the means of two large samples are different or not when the population
variance is known.
A z test is a test that is used to check if the means of two populations are different
or not provided the data follows a normal distribution.
36. Z Test T-Test
A z test is a statistical test that is used to check if
the means of two data sets are different when the
population variance is known.
A t-test is used to check if the means of two data
sets are different when the population variance is not
known.
The sample size is greater than or equal to 30. The sample size is lesser than 30.
The data follows a normal distribution. The data follows a student-t distribution.