School of Management,
Delhi Technological University
• Statistics is a science. It is a way to get information from data to
facilitate decision making or interpretation.
• Examples of Data:
− Daily wholesale prices and arrivals of a particular agricultural
produce (say wheat) in particular markets for last 3 months ;
− Marks of students in QT for last three years;
− Weather data of a city for last 30 days,
− House-wise, model-wise ownership of cars in a particular locality,
(Explain associated interpretations/ decisions).
• Variable: A characteristic of an item or individual
Categorical variables (Qualitative variables):
(have values that can be placed only in categories e.g.
Are you married: Yes/No)
Numerical variables (Quantitative variables):
(have values that represent quantities)
- Are of two types : Discrete and Continuous
• Data: The set of individual values associated with a variable
Data can be:
• quantitative or qualitative
• in grouped or ungrouped form.
Quantitative data can be subjected to arithmetic operations unlike
The field of statistics deals with measurements (Quantitative or
Four generally used scales of measurement (from weakest to
To describe values of a categorical variable, we use: Nominal scale and Ordinal
To describe values of a numerical variable, we use: Interval scale and Ratio scale
Nominal Scale: Here numbers are used simply as labels for
categories. For example, an employee may be (M) Male/ (F)
Female (even if numbers are assigned to categories, these
are arbitrary); Weakest scale because you cannot specify
any ranks across categories
Ordinal Scale: Here, data elements are ordered according
to their relative merit. Ex. A product may be ranked as 1, 2, 3
or 4 where 1 denotes worst quality and 4 the best quality.
Ordinal scale does not tell us how much better a product is
than others. It only tells that it is better.
Thus, ordinal scale is weaker in the sense that it is silent
about the amount of difference between categories.
• Interval Scale: An ordered scale in which the difference between
measurement is an meaningful quantity but does not involve a true zero
the value of 0 is assigned arbitrarily and thus we cannot take ratio of
two measurements. But we can take ratio of intervals.
C is 2 degrees warmer than 50
C and so is a comparison
C and 720
C but the environmetal conditions are totally
Ex: Time of a day is in interval scale. We cannot say that 10 AM is
twice as long as 5 AM. But we can say that interval between 0 AM
and 10 AM (10 hrs) is twice as long as interval between 0 AM and 5
AM (5 hrs). This is because 0 AM does not mean absence of any
• Ratio Scale: An ordered scale in which the difference between
measurement is an meaningful quantity and involves a true 0 point (0
is in ratio scale is an absolute 0). Strongest scale.
• If two measurements are in ratio scale, then we can take ratios of
Ex. Money is measured in ratio scale. A sum of Rs. 0 means no
money and is thus an absolute zero. A sum of Rs. 100 is twice as
large as Rs. 50. Other examples are height, weight, volume, area,
[Note that in interval scale, the interval between two interval scale
measurements is in ratio scale (not the individual observations). ]
Primary Data (Data which you collect yourself for doing analysis)
Secondary Data (Data which is collected by someone else and you use
for doing analysis)
Sources could be:
Data distributed by an organization or individual (e.g. Centre for
Monitoring Indian Economy: www.cmie.com; CRISIL: www.crisil.com;
Nielsen: provide consumer research data to telecom and mobile media
The outcomes of a designed experiment
The responses from a survey
The results of an observational study
Data collected by ongoing business activities
Samples and Population
•The distinction between sample and population is very important in
•A population is the group of all items of interest to an investigator
(not necessarily group of people). Also called universe. In DTU
campus, it may be population of B.Tech. students, population of
MBA students, population of faculty members, etc. Other examples,
Population of weights of cricket bats produced in a factory,
population of cows in a village, etc.
•A descriptive measure of population is called parameter e.g.
average weight of bats produced, average milk given by cows in a
•A sample is a subset of units selected from a population (sampling
units vs sampled units)
•A descriptive measure of sample is called statistic e.g. average
weight of a sample of bats, average milk given by sampled cows.
• A sample is drawn from a population using a sampling
Non Probability Samples
Simple Random Sampling (SRS) (With or Without replacement)
Cluster Sampling, etc.
• The aim is to get a representative sample of the population so
that it leads to near accurate inferences about the population
To be prepared before sampling.
Partial sampling frame may lead to misleading results (e.g. when you
exclude a particular group of people).
When do we prefer sampling over census approach of data
• When selecting a sample is less time consuming than selecting
every item of the population
•When selecting a sample is less costly than selecting every item of
•Analyzing a sample is less cumbersome than analyzing enitre
• Data Cleaning: Removing outliers
• A conclusion drawn about a population based on the information
in a sample from the population is called a statistical inference.
• We use sample statistics to make inference about population
• Conclusion about a population based on the sample statistics may
not always be correct. Therefore, we use measures of reliability
while undertaking statistical inference. Two such measures are:
– Confidence level and
– Significance level.
• Confidence level is the proportion of times an estimation
procedure will be correct. For example, if we use an estimation
procedure and produce an estimate that has a confidence level of
95% that would mean – In the long run, estimates based on this
estimation procedure will be correct 95% of the time.
• Significance level measures how frequently a conclusion drawn
about the population will be wrong in the long run. A 5%
significance level means that, in the long run, a conclusion drawn
would be wrong 5% of the time.
• e.g. a farmer ‘X’ has 1500 sheep. These constitute the
entire population of sheep for farmer ‘X’. If 15 sheep are
selected from this population, it will form a sample of 15
sheep from the population of 1500 sheep. Further, if these
15 sheep are selected at random, the sample would be a
simple random sample.
• Note that Sample and Population are relative to each other.
If we consider the entire district with 20,000 sheep, the 1500
sheep with farmer ‘X’ could be one sample of the district
population of sheep (though not a random sample of 1500
sheep from the district).
Types of Survey Errors
• Validity of survey results must be examined. We must evaluate the
purpose of survey and for whom it is conducted.
• Inferences based on non probability samples could be seriously
• The only way to make valid statistics inference about population is by
using a probability sample.
• Even surveys based on probabilistic samples are subject to four types of
- Coverage error
- Nonresponse error
- Sampling error
- Measurement error
Types of Survey Errors (contd.)
• Our aim should be to minimize these four errors.
non-response bias i.e. bias introduced when we ignore the fact that
certain people may not respond to few questions. The bias gets
introduces when such people belong more to one segment. E.g.
consider a question “Have you ever been arrested?” There may be
poor response to this question from people who have indeed been
Examples: Use of Statistical Inference in Business Situations
•A pharmaceutical manufacturer interested in marketing a new drug may be
required to prove that the drug does not cause any side effects. The drug
may be tested on a random sample of people and the technique of
statistical inference may be used to draw conclusion about the entire
•To assess the popularity of its ATMs, a bank may seek opinion of a
randomly selected sample of customers. Statistical inference can be used
to generalize the conclusions for the entire population of bank’s customers.
•A quality control engineer at a plant making bulbs needs to ensure that not
more than 3 % of the bulbs produced are defective. The engineer may
periodically collect random samples of bulbs and check their quality. Based
on the random samples, the engineer can draw conclusion about the
proportion of defective items in the entire population of bulbs.
Percentiles and Quartiles
• The Pth percentile of a group of numbers is that value below which
lie P% of the numbers in the group. The position of Pth percentile is
given by (n+1)P/100, where n is the number of data points.
• Ex: sales made by each of the 20 sales persons of a departmental
store are as follows:
• (arranged in ascending order – to be done in case data is not ordered)
percentile: 10.5 i.e.16
percentile: 16.8 i.e.19.8
percentile: 18.9 i.e. 21.9
• Quartile are special percentiles which break the distribution of data
into four groups.
• The first quartile is the 25th
percentile. It is the point below which
lie one fourth of data. Also called lower quartile.
• The second quartile is the 50th
percentile. It is the point below
which lie one half of data (also called median). Also called middle
• The third quartile is the 75th
percentile. It is the point below which
lie 75 % of data. Also called upper quartile.
• The difference between third and first quartile is called
interquartile range. It is a measure of spread of data.
Exercise: Interquartile range for above example is 18.75 – 13.25 =
Measures of Central Tendency
Common measures of central tendency (centre of data) : mean,
• Mean or Arithmetic Mean or Average:
(Sample Mean, Population Mean)