2. Introduction
• Statistics is a way of thinking that can lead to better decisions.
• It is science of gathering, presenting, analyzing, and interpreting
data
• It uses mathematics and probability
• Statistics requires analytics skills and is an important part of
business education.
• The DCOVA framework guides your application of statistics.
• Modern-day information technology enables businesses to apply
statistics in new ways to solve business problems utilizing lots of
data and analytical tools.
A.K.Singh, IIM Nagpur
3. Introduction (Contd..)
• Business statistics provides a formal basis to:
• Summarize and visualize business data.
• Reach conclusions from business data.
• Make reliable predictions about business activities.
• Improve business processes.
A.K.Singh, IIM Nagpur
4. Statistics in Business (But not limited to…)
• Operations : Supply chain Performance and Benchmarking
• Accounting: Auditing and cost estimation, Financing
• Economics : Local, regional, national, and international economic
performance
• Finance : Investments and portfolio management
• Human Resource Management: Compensation and Performance
Measurement
• Management Information Systems :Performance of systems
• Marketing :Market analysis and Consumer research
• International Business : International Market and demographic
analysis
A.K.Singh, IIM Nagpur
5. Some Basic Definitions
Variable
• A characteristic of an item or individual.
Data
• The set of individual values associated with one or more variables.
Statistic
• A value that summarizes the data of a particular variable.
Descriptive Statistics
• The methods that primarily help summarize and present data.
A.K.Singh, IIM Nagpur
6. Some Basic Definitions (Contd…)
Inferential Statistics
• Methods that use data collected from a small group to reach conclusions
about larger group.
Population
• The whole collection of all persons, objects, or items under study
Census
• Gathering data from the entire population
Sample
• gathering data on a subset of the population
• Use information about the sample to infer about the population
A.K.Singh, IIM Nagpur
8. Parameter vs. Statistic
• Parameter — descriptive measure of the population
• Usually represented by Greek letters
• Statistic — descriptive measure of a sample
• Usually represented by Roman letters
parameter
population
denotes
variance
population
denotes
2
denotes populationstandard deviation
mean
sample
denotes
x
variance
sample
denotes
s2
deviation
standard
sample
denotes
s
A.K.Singh, IIM Nagpur
9. Process of Inferential Statistics
)
(parameter
Population
1.
)
(statistic
x
Sample
3.
estimate
to
x
Use
4.
sample
random
a
Select
2.
A.K.Singh, IIM Nagpur
10. Uncertainty in Business
• Inferences about parameters made under conditions of uncertainty
(which are always present in statistics)
• Uncertainty can be caused by
• Randomness in selection of a sample
• lack of knowledge about the source of the inferences
• change in conditions
A.K.Singh, IIM Nagpur
11. Statistics in Business
• Probability is used in statistics (will be discussed in details later
in the course)
• To estimate the level of confidence in a confidence interval
• To calculate the p-value in hypothesis testing
A.K.Singh, IIM Nagpur
12. Classifying Variables By Type
Categorical (qualitative) variables take categories as their values
(data) such as “yes,no” or “blue, brown, green” or “Easy, Normal,
Tough” etc..
Numerical (quantitative) variables have values (data) that represent a
counted or measured quantity.
Discrete variables arise from a counting process.
Continuous variables arise from a measuring process.
A.K.Singh, IIM Nagpur
13. Examples of Types of Variables
A.K.Singh, IIM Nagpur
Question Responses Variable Type
Do you have a Facebook profile?
How many whatsapp messages have you sent in the past 1 hour?
How long did the mobile app update take to download?
What is the colour of your eyes ?
What is your weight ?
In which class do you study ?
In which section you are ?
How do you rate New Netflix Series?
14. In nominal measurement the values just "name" the attribute
uniquely. Numbers are used to classify or categorize
• No ordering of the cases is implied.
• Gender.
• boys vs. girls or
• males vs. females
• Religion
• Hindu
• Muslim
• Sikh
• Christian
• Jain etc.
• Employment Classification
• 1 for Educator
• 2 for Construction Worker
• 3 for Manufacturing Worker
Levels of Data Measurement : Nominal
A.K.Singh, IIM Nagpur
15. • A variable is ordinal measurable if ranking is possible for values of the
variable. However, the difference between the numbers are not
comparable.
• For example:
• A gold medal reflects superior performance to a silver or bronze medal in the
Olympics.
• You can’t say a gold and a bronze medal average out to a silver medal,
though.
Position within an organization
• 1 for President
• 2 for Vice President
• 3 for Plant Manager
• 4 for Department Supervisor
• 5 for Employee
• Preference scales are typically ordinal
How much do you like this cereal?
• Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.
Levels of Data Measurement : Ordinal
A.K.Singh, IIM Nagpur
16. In interval measurement the distance between attributes
does have meaning.
• Numerical data typically fall into this category.
• Doesn’t have any absolute 0 value
• For example :
• Measuring temperature
• Scales for measurement
Levels of Data Measurement : Interval
A.K.Singh, IIM Nagpur
17. • Ratio measurement there is always a reference point
that is meaningful (either 0 for rates or 1 for ratios)
• This means that you can construct a meaningful fraction
(or ratio) with a ratio variable.
• In applied social research most "count" variables are ratio,
for example, the number of clients in past six months.
• Height, Weight, and Volume
• Profit and Loss
Levels of Data Measurement : Ratio
A.K.Singh, IIM Nagpur
18. Types of Variables (Summary)
Variables
Categorical Numerical
Discrete Continuous
Examples:
Marital Status
Political Party
Eye Color
(Defined Categories)
Examples:
Number of Children
Defects per hour
(Counted items)
Examples:
Weight
Voltage
(Measured
characteristics)
Nominal Ordinal
Examples: Ratings
Good, Better, Best
Low, Med, High
(Ordered Categories)
A.K.Singh, IIM Nagpur
19. Sources of Data
Primary Sources: The data collector is the one using the data for
analysis:
Data from a political survey.
Data collected from an experiment.
Observed data.
Secondary Sources: The person performing data analysis is not the
data collector:
Analyzing census data.
Examining data from print journals or data published on the
internet.
A.K.Singh, IIM Nagpur
21. A.K.Singh, IIM Nagpur
Categorical Data
One Categorical
Variable
Summary Table
Two/More
Categorical
Variable
Contingency
Table
Organization of Categorical Data
22. Organization of Numerical Data
A.K.Singh, IIM Nagpur
Numerical
Data
Ordered
Array
Frequency
Distribution
Cumulative
Distribution
23. Visualization of Categorical Data
Categorical
Data
Visualizing Data
Bar
Chart
Summary
Table For One
Variable
Contingency
Table For Two
Variables
Side By Side
Bar Chart
Pie Chart
Pareto
Chart
A.K.Singh, IIM Nagpur
24. Visualization of Numerical Data
Numerical Data : 1 Variable
Ordered Array
Stem-and-Leaf
Display
Histogram Polygon Ogive
Frequency Distributions
and
Cumulative Distributions
A.K.Singh, IIM Nagpur
25. Visualization of Numerical Data (Contd..)
Numerical Data : 2 Variable
Scatter Plot Time Series
A.K.Singh, IIM Nagpur
26. Organizing Many Variables
• Use Pivot Chart
• It summarizes variables as a multidimensional summary table.
• It allows interactive changing of the level of summarization and
formatting of the variables.
• It allows to interactively “slice” data to summarize subsets of data
that meet specified criteria.
• It can be used to discover possible patterns and relationships in
multidimensional data that simpler tables and charts would fail to
make apparent.
A.K.Singh, IIM Nagpur
27. Best Practices for Constructing Visualizations
Use the simplest possible visualization.
Include a title & label all axes.
Include a scale for each axis if the chart contains axes.
Begin the scale for a vertical axis at zero & use a constant scale.
Avoid 3D or “exploded” effects etc..
Use consistent colorings in charts meant to be compared.
Avoid using uncommon chart types including radar, surface, bubble, cone,
and pyramid charts.
A.K.Singh, IIM Nagpur
29. Introduction
The central tendency is the extent to which the values of a numerical
variable group around a typical or central value.
The variation is the amount of dispersion or scattering away from a
central value that the values of a numerical variable show.
The shape is the pattern of the distribution of values from the lowest
value to the highest value.
A.K.Singh, IIM Nagpur
30. Measures of Central Tendency
• Mean
• Average of all the values
• Affected by extreme values (Also called Outliers)
• Median
• In an ordered array, the median is the “middle” number (50% above, 50% below).
• Median position can be determined by formula (n+1)/2, where n is the number of
values of a given data set. The value at that given position is called median value.
• For a data set with even number of values, it will be average of the two middle
values.
• Less sensitive than the mean to extreme values.
• Mode
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.
A.K.Singh, IIM Nagpur
31. Measures of Central Tendency (Contd..)
• Used to measure the rate of change of a variable over time.
The mean is generally used, unless extreme values (outliers) exist.
The median is often used, since the median is not sensitive to extreme
values. For example, median home prices may be reported for a
region; it is less sensitive to outliers.
In many situations it makes sense to report both the mean and the
median.
A.K.Singh, IIM Nagpur
32. Measures of Central Tendency: Summary
Central Tendency
Arithmetic
Mean
Median Mode
n
X
X
n
i
i
1
Middle value in
the ordered array
Most frequently
observed value
A.K.Singh, IIM Nagpur
33. Measures of Variation
A.K.Singh, IIM Nagpur
Measures of variation give information on the spread or
variability or dispersion of the data values.
It is important to look at the dispersions as well and not only at
central value for better understanding.
Variation
Standard
Deviation
Coefficient
of Variation
Range Variance
34. Measures of Variation (Contd..)
• Range = Xlargest – Xsmallest
• Does not account for how the data are distributed.
• Sensitive to outliers
• Sample Variance : Average (approximately) of squared deviations of values from
the mean.
• Sample Standard Deviation : is the square root of the variance.
• Has the same units as the original data.
• Most commonly used measure of variation.
• Shows variation about the mean.
• For Population, the denominator will be n in place of n-1 (makes sample
estimators unbiased). (Discussion on unbiased estimator is for advanced courses).
A.K.Singh, IIM Nagpur
1
-
n
)
X
(X
S
n
1
i
2
i
2
1
-
n
)
X
(X
S
n
1
i
2
i
35. Measures of Variation (Contd..)
• Coefficient of Variation
• Measures relative variation.
• Always in percentage (%).
• Shows variation relative to mean.
• Can be used to compare the variability of two or more sets of data measured
in different units.
A.K.Singh, IIM Nagpur
100%
X
S
CV
36. Measures of Variation: Comparing Coefficients of Variation
• Stock A:
• Mean price last year = $50.
• Standard deviation = $5.
𝐶𝑉 𝐴 =
𝑆𝐴
𝑋𝐴
∗ 100 =
5
50
∗ 100 = 10 %
• Stock B:
• Mean price last year = $60.
• Standard deviation = $10.
𝐶𝑉 𝐵 =
𝑆𝐵
𝑋𝐵
∗ 100 =
10
60
∗ 100 = 16.67 %
A.K.Singh, IIM Nagpur
37. Shape of a Distribution
• Describes how data are distributed.
• Two useful shape related statistics are:
• Skewness:
• Measures the extent to which data values are not symmetrical.
• Kurtosis:
• Kurtosis measures the peakedness of the curve of the distribution—that
is, how sharply the curve rises approaching the center of the distribution.
A.K.Singh, IIM Nagpur
38. Shape of a Distribution (Skewness)
• Measures the extent to which data is not symmetrical. Most widely used
formula for Coefficient for Skewness is 3(Mean-Median)/ SD.
Mean = Median = Mode
Mean < Median < Mode Mode < Median < Mean
Right-Skewed
Left-Skewed Symmetric
Skewness
Statistic < 0 0 >0
MEAN MEDIAN MODE
A.K.Singh, IIM Nagpur
39. Shape of a Distribution -- Kurtosis
• It measures how
sharply the curve
rises approaching
the center of the
distribution
A.K.Singh, IIM Nagpur
Sharper Peak
Than Bell-Shaped
(Kurtosis > 3)
Flatter Than
Bell-Shaped
(Kurtosis < 3)
Bell-Shaped
(Kurtosis = 3)
40. Exploring Numerical Data Using Quartiles
• The five-number summary.
• Constructing a boxplot.
• General formula of finding percentile position is = (P/100)*n where n is
the number of values in a given data set.
• If the result is a whole number then it is the ranked position to use.
• If the result is a fractional half , then average the two corresponding
data values.
• The IQR is Q3 – Q1 and measures the spread in the middle 50% of the
data.
• The IQR is also called the midspread because it covers the middle 50%
of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.
A.K.Singh, IIM Nagpur
41. The Empirical Rule for Normal Distribution
Vs Chebyshev’s Rule for any other distribution
• Chebyshev’s Rule : Regardless of how the data are distributed, at least (1 - 1/k2) *
100% of the values will fall within k standard deviations of the mean (for k > 1).
μ
68%
𝜇 ± 2𝜎
𝜇 ± 𝜎
𝜇 ± 3𝜎
Another 13.5 %
Another 2.35 %
Range Empirical for Normal
Curve
Chebyshev’s Rule for any
Distribution
𝜇 ± 𝜎 68% NA for K< 1
𝜇 ± 2𝜎 95% 75%
𝜇 ± 3𝜎 99.7% 88.89%
A.K.Singh, IIM Nagpur
42. Measures Of The Relationship Between Two Numerical
Variables
• Scatter plots allow you to visually examine the relationship
• The Covariance {Cov(x,y)}
• The covariance measures the linear relationship between two numerical variables
• Only concerned with the nature of the relationship.
• No causal effect is implied.
• > 0, < 0, = 0, nature of movement of variable is same, opposite and are independent respectively.
Relative strength of relationship is missing.
• The Coefficient of Correlation (r)
• Measures the relative strength of the linear relationship between two numerical
variables.
• Varies between -1 to +1, which represents strong negative relationship to strong
positive relationship
• The coefficient of Determination ( r2)
• shows percentage variation in y which is explained by all the x variables together
• Varies between 0 and 1, higher the better the causal relationship explained.
A.K.Singh, IIM Nagpur