BS 1 and 2 30th Oct.pptx

BUSINESS STATISTICS
DATA ORGANIZATION, VISUALIZATION &
Description
By
Prof Alok Kumar Singh

Introduction
• Statistics is a way of thinking that can lead to better decisions.
• It is science of gathering, presenting, analyzing, and interpreting
data
• It uses mathematics and probability
• Statistics requires analytics skills and is an important part of
business education.
• The DCOVA framework guides your application of statistics.
• Modern-day information technology enables businesses to apply
statistics in new ways to solve business problems utilizing lots of
data and analytical tools.
A.K.Singh, IIM Nagpur

Introduction (Contd..)
• Business statistics provides a formal basis to:
• Summarize and visualize business data.
• Reach conclusions from business data.
• Make reliable predictions about business activities.
• Improve business processes.

Statistics in Business (But not limited to…)
• Operations : Supply chain Performance and Benchmarking
• Accounting: Auditing and cost estimation, Financing
• Economics : Local, regional, national, and international economic
performance
• Finance : Investments and portfolio management
• Human Resource Management: Compensation and Performance
Measurement
• Management Information Systems :Performance of systems
• Marketing :Market analysis and Consumer research
• International Business : International Market and demographic
analysis

Some Basic Definitions
Variable
• A characteristic of an item or individual.
Data
• The set of individual values associated with one or more variables.
Statistic
• A value that summarizes the data of a particular variable.
Descriptive Statistics
• The methods that primarily help summarize and present data.

Some Basic Definitions (Contd…)
Inferential Statistics
• Methods that use data collected from a small group to reach conclusions
about larger group.
Population
• The whole collection of all persons, objects, or items under study
Census
• Gathering data from the entire population
Sample
• gathering data on a subset of the population
• Use information about the sample to infer about the population

Population vs Sample
Population Sample

Parameter vs. Statistic
• Parameter — descriptive measure of the population
• Usually represented by Greek letters
• Statistic — descriptive measure of a sample
• Usually represented by Roman letters
parameter
population
denotes

variance
population
denotes
2

 denotes populationstandard deviation
mean
sample
denotes
x
variance
sample
denotes
s2
deviation
standard
sample
denotes
s

Process of Inferential Statistics
)
(parameter
Population
1.

)
(statistic
x
Sample
3.

estimate
to
x
Use
4.
sample
random
a
Select
2.

Uncertainty in Business
• Inferences about parameters made under conditions of uncertainty
(which are always present in statistics)
• Uncertainty can be caused by
• Randomness in selection of a sample
• lack of knowledge about the source of the inferences
• change in conditions

Statistics in Business
• Probability is used in statistics (will be discussed in details later
in the course)
• To estimate the level of confidence in a confidence interval
• To calculate the p-value in hypothesis testing

Classifying Variables By Type
 Categorical (qualitative) variables take categories as their values
(data) such as “yes,no” or “blue, brown, green” or “Easy, Normal,
Tough” etc..
 Numerical (quantitative) variables have values (data) that represent a
counted or measured quantity.
 Discrete variables arise from a counting process.
 Continuous variables arise from a measuring process.

Examples of Types of Variables
Question Responses Variable Type
Do you have a Facebook profile?
How many whatsapp messages have you sent in the past 1 hour?
How long did the mobile app update take to download?
What is the colour of your eyes ?
What is your weight ?
In which class do you study ?
In which section you are ?
How do you rate New Netflix Series?

In nominal measurement the values just "name" the attribute
uniquely. Numbers are used to classify or categorize
• No ordering of the cases is implied.
• Gender.
• boys vs. girls or
• males vs. females
• Religion
• Hindu
• Muslim
• Sikh
• Christian
• Jain etc.
• Employment Classification
• 1 for Educator
• 2 for Construction Worker
• 3 for Manufacturing Worker
Levels of Data Measurement : Nominal

• A variable is ordinal measurable if ranking is possible for values of the
variable. However, the difference between the numbers are not
comparable.
• For example:
• A gold medal reflects superior performance to a silver or bronze medal in the
Olympics.
• You can’t say a gold and a bronze medal average out to a silver medal,
though.
Position within an organization
• 1 for President
• 2 for Vice President
• 3 for Plant Manager
• 4 for Department Supervisor
• 5 for Employee
• Preference scales are typically ordinal
How much do you like this cereal?
• Like it a lot, somewhat like it, neutral, somewhat dislike it, dislike it a lot.
Levels of Data Measurement : Ordinal

In interval measurement the distance between attributes
does have meaning.
• Numerical data typically fall into this category.
• Doesn’t have any absolute 0 value
• For example :
• Measuring temperature
• Scales for measurement
Levels of Data Measurement : Interval

• Ratio measurement there is always a reference point
that is meaningful (either 0 for rates or 1 for ratios)
• This means that you can construct a meaningful fraction
(or ratio) with a ratio variable.
• In applied social research most "count" variables are ratio,
for example, the number of clients in past six months.
• Height, Weight, and Volume
• Profit and Loss
Levels of Data Measurement : Ratio

Types of Variables (Summary)
Variables
Categorical Numerical
Discrete Continuous
Examples:
 Marital Status
 Political Party
 Eye Color
(Defined Categories)
Examples:
 Number of Children
 Defects per hour
(Counted items)
Examples:
 Weight
 Voltage
(Measured
characteristics)
Nominal Ordinal
Examples: Ratings
 Good, Better, Best
 Low, Med, High
(Ordered Categories)

Sources of Data
 Primary Sources: The data collector is the one using the data for
analysis:
 Data from a political survey.
 Data collected from an experiment.
 Observed data.
 Secondary Sources: The person performing data analysis is not the
data collector:
 Analyzing census data.
 Examining data from print journals or data published on the
internet.

Organizing and Visualization
of Variables

Categorical Data
One Categorical
Variable
Summary Table
Two/More
Categorical
Variable
Contingency
Table
Organization of Categorical Data

Organization of Numerical Data
Numerical
Data
Ordered
Array
Frequency
Distribution
Cumulative
Distribution

Visualization of Categorical Data
Categorical
Data
Visualizing Data
Bar
Chart
Summary
Table For One
Variable
Contingency
Table For Two
Variables
Side By Side
Bar Chart
Pie Chart
Pareto
Chart

Visualization of Numerical Data
Numerical Data : 1 Variable
Ordered Array
Stem-and-Leaf
Display
Histogram Polygon Ogive
Frequency Distributions
and
Cumulative Distributions

Visualization of Numerical Data (Contd..)
Numerical Data : 2 Variable
Scatter Plot Time Series

Organizing Many Variables
• Use Pivot Chart
• It summarizes variables as a multidimensional summary table.
• It allows interactive changing of the level of summarization and
formatting of the variables.
• It allows to interactively “slice” data to summarize subsets of data
that meet specified criteria.
• It can be used to discover possible patterns and relationships in
multidimensional data that simpler tables and charts would fail to
make apparent.

Best Practices for Constructing Visualizations
 Use the simplest possible visualization.
 Include a title & label all axes.
 Include a scale for each axis if the chart contains axes.
 Begin the scale for a vertical axis at zero & use a constant scale.
 Avoid 3D or “exploded” effects etc..
 Use consistent colorings in charts meant to be compared.
 Avoid using uncommon chart types including radar, surface, bubble, cone,
and pyramid charts.

Introduction
 The central tendency is the extent to which the values of a numerical
variable group around a typical or central value.
 The variation is the amount of dispersion or scattering away from a
central value that the values of a numerical variable show.
 The shape is the pattern of the distribution of values from the lowest
value to the highest value.

Measures of Central Tendency
• Mean
• Average of all the values
• Affected by extreme values (Also called Outliers)
• Median
• In an ordered array, the median is the “middle” number (50% above, 50% below).
• Median position can be determined by formula (n+1)/2, where n is the number of
values of a given data set. The value at that given position is called median value.
• For a data set with even number of values, it will be average of the two middle
values.
• Less sensitive than the mean to extreme values.
• Mode
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.

Measures of Central Tendency (Contd..)
• Used to measure the rate of change of a variable over time.
 The mean is generally used, unless extreme values (outliers) exist.
 The median is often used, since the median is not sensitive to extreme
values. For example, median home prices may be reported for a
region; it is less sensitive to outliers.
 In many situations it makes sense to report both the mean and the
median.

Measures of Central Tendency: Summary
Central Tendency
Arithmetic
Mean
Median Mode
n
X
X
n
i
i


 1
Middle value in
the ordered array
Most frequently
observed value

Measures of Variation
 Measures of variation give information on the spread or
variability or dispersion of the data values.
 It is important to look at the dispersions as well and not only at
central value for better understanding.
Variation
Standard
Deviation
Coefficient
of Variation
Range Variance

Measures of Variation (Contd..)
• Range = Xlargest – Xsmallest
• Does not account for how the data are distributed.
• Sensitive to outliers
• Sample Variance : Average (approximately) of squared deviations of values from
the mean.
• Sample Standard Deviation : is the square root of the variance.
• Has the same units as the original data.
• Most commonly used measure of variation.
• Shows variation about the mean.
• For Population, the denominator will be n in place of n-1 (makes sample
estimators unbiased). (Discussion on unbiased estimator is for advanced courses).
1
-
n
)
X
(X
S
n
1
i
2
i
2




1
-
n
)
X
(X
S
n
1
i
2
i





Measures of Variation (Contd..)
• Coefficient of Variation
• Measures relative variation.
• Always in percentage (%).
• Shows variation relative to mean.
• Can be used to compare the variability of two or more sets of data measured
in different units.
100%
X
S
CV 










Measures of Variation: Comparing Coefficients of Variation
• Stock A:
• Mean price last year = $50.
• Standard deviation = $5.
𝐶𝑉 𝐴 =
𝑆𝐴
𝑋𝐴
∗ 100 =
5
50
∗ 100 = 10 %
• Stock B:
• Mean price last year = $60.
• Standard deviation = $10.
𝐶𝑉 𝐵 =
𝑆𝐵
𝑋𝐵
∗ 100 =
10
60
∗ 100 = 16.67 %

Shape of a Distribution
• Describes how data are distributed.
• Two useful shape related statistics are:
• Skewness:
• Measures the extent to which data values are not symmetrical.
• Kurtosis:
• Kurtosis measures the peakedness of the curve of the distribution—that
is, how sharply the curve rises approaching the center of the distribution.

Shape of a Distribution (Skewness)
• Measures the extent to which data is not symmetrical. Most widely used
formula for Coefficient for Skewness is 3(Mean-Median)/ SD.
Mean = Median = Mode
Mean < Median < Mode Mode < Median < Mean
Right-Skewed
Left-Skewed Symmetric
Skewness
Statistic < 0 0 >0
MEAN MEDIAN MODE

Shape of a Distribution -- Kurtosis
• It measures how
sharply the curve
rises approaching
the center of the
distribution
Sharper Peak
Than Bell-Shaped
(Kurtosis > 3)
Flatter Than
Bell-Shaped
(Kurtosis < 3)
Bell-Shaped
(Kurtosis = 3)

Exploring Numerical Data Using Quartiles
• The five-number summary.
• Constructing a boxplot.
• General formula of finding percentile position is = (P/100)*n where n is
the number of values in a given data set.
• If the result is a whole number then it is the ranked position to use.
• If the result is a fractional half , then average the two corresponding
data values.
• The IQR is Q3 – Q1 and measures the spread in the middle 50% of the
data.
• The IQR is also called the midspread because it covers the middle 50%
of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.

The Empirical Rule for Normal Distribution
Vs Chebyshev’s Rule for any other distribution
• Chebyshev’s Rule : Regardless of how the data are distributed, at least (1 - 1/k2) *
100% of the values will fall within k standard deviations of the mean (for k > 1).
μ
68%
𝜇 ± 2𝜎
𝜇 ± 𝜎
𝜇 ± 3𝜎
Another 13.5 %
Another 2.35 %
Range Empirical for Normal
Curve
Chebyshev’s Rule for any
Distribution
𝜇 ± 𝜎 68% NA for K< 1
𝜇 ± 2𝜎 95% 75%
𝜇 ± 3𝜎 99.7% 88.89%

Measures Of The Relationship Between Two Numerical
Variables
• Scatter plots allow you to visually examine the relationship
• The Covariance {Cov(x,y)}
• The covariance measures the linear relationship between two numerical variables
• Only concerned with the nature of the relationship.
• No causal effect is implied.
• > 0, < 0, = 0, nature of movement of variable is same, opposite and are independent respectively.
Relative strength of relationship is missing.
• The Coefficient of Correlation (r)
• Measures the relative strength of the linear relationship between two numerical
variables.
• Varies between -1 to +1, which represents strong negative relationship to strong
positive relationship
• The coefficient of Determination ( r2)
• shows percentage variation in y which is explained by all the x variables together
• Varies between 0 and 1, higher the better the causal relationship explained.

BS 1 and 2 30th Oct.pptx

Recommended

Recommended

More Related Content

Similar to BS 1 and 2 30th Oct.pptx

Similar to BS 1 and 2 30th Oct.pptx (20)

Recently uploaded

Recently uploaded (20)

BS 1 and 2 30th Oct.pptx

Editor's Notes