2. PROCESSING OPERATIONS
• Data processing is a crucial stage in research. After collecting the data from the
field, the researcher has to process and analyze them in order to arrive at certain
conclusions which may confirm or invalidate the hypothesis which he had
formulated towards the beginning of research worth.
• The processing of data includes editing, coding, classification and tabulation. The
collected data should be organized in such a way so that table charts can be
prepared for presentation. The processing of data is necessary because, the data
collected should be examined and errors and mistakes are rectified so that at the
stage of analysis of data, no difficulty is experienced. Various steps involved in
processing of data are Editing, Coding, Classification and Tabulation.
3. PROBLEMS IN PROCESSING
• The most important problem of data processing is to make sense of available data.
In situation that we have to explore existing database with only poor description of
information stored inside, problem with complexity is significant. Information in
relational database is basically well organized.
4. MAJOR DATA PROCESSING CHALLENGES
FACED BY DATA SCIENTISTS
• Ensuring that the Right Data is Being Collected
• Collecting Data from Multiple Sources
• Dealing with Unstructured Data
• Ensure Data is Stored Securely and Efficiently
• Distributed and Parallel Processing Infrastructure
5. TYPES OF ANALYSIS
• Different Types of Data Analysis
• Descriptive analysis.
• Diagnostic analysis.
• Exploratory analysis.
• Inferential analysis.
• Predictive analysis.
• Causal analysis.
• Mechanistic analysis.
• Prescriptive analysis.
6. DESCRIPTIVE ANALYSIS
• Descriptive analysis is an important phase in data exploration that involves
summarizing and describing the primary properties of a dataset. It provides vital
insights into the data's frequency distribution, central tendency, dispersion, and
identifying position.
• Descriptive analytics can help to identify the areas of strength and weakness in an
organization. Examples of metrics used in descriptive analytics include year-over-
year pricing changes, month-over-month sales growth, the number of users, or the
total revenue per subscriber.
7. DIAGNOSTIC ANALYSIS
• Diagnostic analytics is the process of using data to determine the causes of trends
and correlations between variables. It can be viewed as a logical next step after
using descriptive analytics to identify trends. Diagnostic analysis can be done
manually, using an algorithm, or with statistical software (such as Microsoft Excel).
• There several concepts to understand before diving into diagnostic analytics:
hypothesis testing, the difference between correlation and causation, and diagnostic
regression analysis.
8. EXPLORATORY ANALYSIS
• Exploratory data analysis (EDA) is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing
data visualization methods.
• Example: Using EDA, you are open to the fact that any number of people might buy
any number of different types of shoes. You visualize the data using exploratory
data analysis to find that most customers buy 1-3 different types of shoes. Sneakers,
dress shoes, and sandals seem to be the most popular ones.
9. INFERENTIAL ANALYSIS
• Inferential statistics is the practice of using sampled data to draw conclusions or
make predictions about a larger sample data sample or population.
• Inferential statistics are where an “inference” is made based on a smaller sample of
data from the population. Example: You survey 100 people who go to Notre Dame
about their about their marital status. 10 of the 100 say they are married. You infer
that 10% of the students who go to ND are married.
10. PREDICTIVE ANALYSIS.
• Predictive analytics is the process of using data to forecast future outcomes. The
process uses data analysis, machine learning, artificial intelligence, and statistical
models to find patterns that might predict future behavior.
• The goal of predictive analytics is to make predictions about future events, then use
those predictions to improve decision-making. Predictive analytics is used in a
variety of industries including finance, healthcare, marketing, and retail.
• Predictive analytics models are designed to assess historical data, discover patterns,
observe trends, and use that information to predict future trends. Popular predictive
analytics models include classification, clustering, and time series models.
11. CAUSAL ANALYSIS
• Causal research, also known as explanatory research or causal-comparative research, identifies the
extent and nature of cause-and-effect relationships between two or more variables. It's often used by
companies to determine the impact of changes in products, features, or services process on critical
company metrics.
• Causal analysis is a process for identifying and addressing the causes and effects of a challenge or
problem. Instead of addressing the symptoms of a problem, causal analysis helps identify the root causes
so those symptoms become less impactful.
• There are two main methods of causal research: experiments and quasi-experiments. Experiments are
the most rigorous and valid way of establishing causality, as they involve randomly assigning the
participants to different groups or conditions, and controlling for any confounding factors that might
affect the outcome.
• For example, a company implements a new individual marketing strategy for a small group of customers
and sees a measurable increase in monthly subscriptions. After receiving identical results from several
groups, they concluded that the one-to-one marketing strategy has the causal relationship they intended.
12. MECHANISTIC ANALYSIS
• Mechanistic analysis is used to understand exact changes in variables that lead to
other changes in other variables. Here's what you need to know: It's applied in
physical or engineering sciences, situations that require high precision and little
room for error, only noise in data is measurement error.
• Example: Studies to understand a biological or behavioural process that involves an
intervention with a known mechanism of action; these studies could be performed in
normal human subjects.
13. PRESCRIPTIVE ANALYSIS
• Prescriptive analytics is a statistical method that focuses on finding the ideal way
forward or action necessary for a particular scenario, based on data. Prescriptive
analytics uses both descriptive and predictive analytics but the focus here remains
on actionable insights rather than data monitoring.
• Example: For instance, if you regularly watch shoe review videos on YouTube, the
platform's algorithm will likely analyze that data and recommend you watch more of
the same type of video or similar content you may find interesting.
14. EDITING
• Editing means to rectify or to set to order or to correct or to establish sequence.
Editing is the process of examining the data collected in questionnaire or interview
schedule to deduct errors and omissions and to correct those if possible. When the
whole data collection is over, a final and thorough check up is made for data
processing. It is better if the data collected is verified even before the data analysis
is carried out. In this process editing is the first step. Editing is done to assure that
the collected data are accurate, consistent with other facts gathered uniformly
entered and as complete as possible. For example imagine if we get the newspaper
unedited, how the news will appear? Similarly, an unedited film will have no
sequence of events, which means the story cannot be understood at all.
15. TYPES OF EDITING
• Field editing- Field editing is the process for completing the information recorded in abbreviated or in
illegible form at the time of recording the respondent’s response. This sort of editing should be carried out
as soon as possible after interview. In field editing completeness of the forms should be checked by
person. It may be possible that the investigator might have forgotten to record the information. If
investigator recorded information is incomplete form using abbreviations than it should be completed.
• Central editing / Centralized editing- Central editing is done on the return to the office after completing
all forms of schedule. This sort of editing is performed by single editor or by a team of editors. The editors
are free to correct the obvious errors such as an entry in the wrong place, entry recorded in different
units and the like. At central level, editors must correct various mistakes of the investigator. In case of
gap in the answers the editor will be required to decide the proper answer to meet out the gap in answer.
This can be done by reviewing the other information in questionnaire. Some times in spite of all efforts, if
correctness of the answer is impossible than it is safe to strike out such wrong answers. All the wrong
answers should be dropped by the editors.
16. CODING
• Coding is the process of organizing the data or response into classes or categories
and assigning numerical or other symbols to responses according to the class or
category in which they fall. Hence coding is considered as the classification process.
Coding is necessary for efficient analysis. Coding is used to compartmentalize
several replies effective into a small number of classes which contain the critical
information required for analysis. In the process of coding, the study of answer is the
first step and the last step is transfer of information from the schedule to the
separate sheet called transcription sheet.
17. HYPOTHESIS TESTING-CHI-SQUARE TEST
• We use a Chi-square test for hypothesis tests about whether your data is as expected.
The basic idea behind the test is to compare the observed values in your data to the
expected values that you would see if the null hypothesis is true.
• formula for the chi-square hypothesis test-
• Compute the Chi-square statistic using the formula: Χ² = Σ [ (O_i – E_i)² / E_i ], where
O_i is the observed frequency and E_i is the expected frequency. 4. Compare the
calculated statistic with the critical value from the Chi-square distribution to draw a
conclusion.
• A chi-square test is a statistical test used to compare observed results with expected
results. The purpose of this test is to determine if a difference between observed data
and expected data is due to chance, or if it is due to a relationship between the variables
you are studying.
19. Z TEST,
• A z-test is a statistical test to determine whether two population means are different
or to compare one mean to a hypothesized value when the variances are known and
the sample size is large. A z-test is a hypothesis test for data that follows a normal
distribution.
• The formula for calculating the Z Test statistic is Z = (x̄ - µ) / (σ / √n). Here x̄ is the
sample mean, µ is the population mean, σ is the population standard deviation, and
n represents the sample size.