1. CHAPTER 1
INTRODUCTION TO STATISTICS
Expected Outcomes
Able to define basic terminologies of statistics.
Able to apply the basic steps in the statistical problem-solving
methodology for various applications.
Able to summarise and analyse data using measures of central
tendency, measures of variation and measures of position.
Able to relate the concept of accuracy and precision of data using game
of darts.
Able to conduct exploratory data analysis that includes numerical data
analysis and various graphical displays.
Able to plot and interpret normal probability plot.
SZS2017
CONTENT
1.1 Statistical Terminologies
1.2 Statistical Problem Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.2.1 Accuracy and Precision
1.3.3 Measures of Position
1.3.4 Descriptive Statistics Using Microsoft Excel
1.4 Exploratory Data Analysis
1.4.1 Outliers
1.4.2 Box Plot
1.5 Normal Probability Plot
SZS2017
1.1 STATISTICAL
TERMINOLOGIES
Define the meaning of statistics, population,
sample, parameter, statistic, descriptive statistics
and inferential statistics.
Discuss the importance of statistics in daily lives.
SZS2017
1.1.1 What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
Ten thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
The death rate from lung cancer was 10 times higher for smokers compared
to nonsmokers.
The average cost of a wedding is nearly RM10,000 in Malaysia.
In Malaysia, the median salary for men with a bachelor’s degree is
RM 30,000 per year, while the median salary for women with a bachelor’s
degree is RM 29,000 per year.
Globally, an estimated of 500,000 children under the age of 15 live with Type
1 diabetes.
Women who eat fish once a week are 29% less likely to develop heart disease.
SZS2017
2. What is Statistics?
The sciences of conducting studies to collect, organise, summarise,
analyse, present, interpret and draw conclusions from data.
Any values (observations or measurements) that have been collected
Collection and analysis of data are the most important part in research
methodology.
Researchers must have a basic knowledge of statistics before starting any
research or study involving data analysis.
Statistics is also used to analyse the results of surveys and as a tool in
scientific research to make decisions based on controlled experiments,
estimation, prediction, and quality control.
SZS2017
Basic knowledge of statistics is needed in any disciplines or any field of
research or study (in almost all fields of human endeavour) that involve data
analysis.
The methods of statistics allow the researchers to design a valid experiment
and finally draw a reliable conclusion or interpretation from the data they
produced and analysed.
Examples:
In sports, statistician may keep records of the number of successful kicks a
team scored during a football season.
In public health, a doctor might be concerned with the number of child who
are infected with a H1N1 virus during a certain year.
In education, an educator might want to know if the performance of
students in current semester are better than the previous semester.
1.1.2 Why we Need Statistics?
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:
1. Describing the relationship between variables.
a. A university admission director needs to find an effective way of
selecting students. He designed a statistical study to see if there is a
significance relationship between SPM result and the GPA achieved by
first year students at his university. If there is a strong relationship,
high SPM result will become an important criterion for admission.
b. A management consultant wants to compare a client’s investment
return for this year with related figures from last year. He summarises
the revenue and cost data from both periods and find the relationship
between these two variables. Based on his findings, he presents his
recommendations to his client.
Variables is a characteristic or attribute that can assume different values. These
values are data. It is called random variables if the values are determined by chance.
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:
2. Making better decision in the face of uncertainty.
a. Suppose that a manager of Unisex Hair Stylist claimed that 90% of the
customers are satisfied with the services. If a consumer activist feels
that this is an exaggerated statement that might require legal action,
the activist can use statistical inference techniques to decide whether
or not to sue the manager. Therefore, the knowledge gained from
studying statistics can enhance the awareness towards becoming
better consumers.
b. People can make intelligent decisions about what products to purchase
based on consumer studies about government spending based on
utilisation studies, and so on.
SZS2017
3. 1.1.3 Population and Sample
Population (N)
A complete collection of
measurements, outcomes, objects or
individuals under study.
Tangible
finite and the total number of
subjects is fixed and could be listed
→ all computers in a room, all female
students in a university, or all electrical
components manufactured in a day, etc.
Conceptual (Intangible)
all values that might possibly have
been observed and has an unlimited
number of subjects.
→ simulated data from computer or
instrument, number of germs on human
body, all experimental data such as all
measurements of length of metal rod, etc.
Sample (n)
A subset of the population that
is observed
SZS2017
Parameter and Statistic
Parameter
A numerical value that represents a
certain population characteristic
Statistic
A numerical value that represents a
certain sample characteristic
The average of weight for a sample of
female students selected from all students in
a university
The percentage of defective components in
a sample of 100 electrical components
The average of weight of students from a
population of students in a university
The percentage of defective components in
a population of electrical components
manufactured in a day
Measurement Parameter Statistic
Mean (Average)
Variance
Standard deviation
Proportion
x
2
2
s
s
p
SZS2017
EXAMPLE 1.1
A travel agent claims that the average number of rooms in large hotels in
Pahang is 500 and the standard deviation is 165. A sample of seven hotels in
Genting Highlands is selected and the average number of rooms is found to be
435 with standard deviation of 15.
Based on the above example:
The population under study is all large hotels in Pahang.
The sample selected is seven large hotels in Genting Highlands.
The population under study is tangible since there are finite numbers of
large hotels in Pahang.
The characteristic (variable) is number of rooms.
The parameters are 500 and 𝜎 = 165 since they describe the
population characteristics.
The statistics are ҧ
𝑥 = 435 and s = 15 since they describe the sample
characteristics.
SZS2017
EXERCISE 1.1.3
The number of first year students at a residential college is 317 students. An IQ
pre-test is given to all of them in their first week. The dean of admission
collected data on 27 of them and found their mean score on the IQ pre-test was
51. The mean for the entire first year students was estimated to be
approximately 51. A subsequent computer analysis of all first year students
showed that the true mean (population mean) is 52.
Based on the above statement, answer the following questions.
a) What is the population?
b) Is the population tangible or conceptual?
c) What is the sample?
d) What is the variable of the study
e) Which number describes a parameter?
f) Which number describes a statistic?
SZS2017
4. 1.1.4 Descriptive and
Inferential Statistics
Descriptive statistics
Includes the process of data collection,
data organisation, data classification,
data summarisation, and data
presentation obtained from the sample.
Used to describe the characteristics of
the sample.
Used to determine whether the sample
represents the target population by
comparing sample statistic and
population parameter.
Inferential statistics
Involves a process of generalisation,
estimations, hypothesis testing, predictions
and determination of relationships between
variables.
Used to describe, infer, estimate,
approximate the characteristics of the target
population.
Used when we want to draw a conclusion
for the data obtain from the sample.
EXAMPLE:
Ten thousands parents in Malaysia have
chosen Takaful Insurance as their
trusted life insurance agency.
EXAMPLE:
The death rate of lung cancer was 10 times
higher for smokers compared to
nonsmokers .
SZS2017
Overview of descriptive
and inferential statistics
SZS2017
EXERCISE 1.1.4
In the statements below, decide whether the statements describe the
descriptive statistics or inferential statistics.
a) The average cost of a wedding is nearly RM10,000.
b) In Malaysia, the median salary for men with a bachelor’s degree
is RM 30,000 per year, while the median salary for women with a
bachelor’s degree is RM 29,000 per year.
c) Globally, an estimated of 500,000 children under the age of 15
live with Type 1 diabetes.
d) A researcher claims that a new drug will reduce the number of
heart attacks in men over 70 years of age.
SZS2017
1.1.5 Role of the Computer in Statistics
Two software tools commonly used for data analysis:
1. Spreadsheets
Microsoft Excel & Lotus 1-2-3
2. Statistical Packages
AMOS, eViews, MINITAB, R, SAS, SmartPLS,
SPSS and SPlus
SZS2017
5. Data Analysis Application Tools in EXCEL
1. Graph and chart
2. Formulas
3. Data Analysis Tools:
File → Options → Add-Ins
→ Analysis ToolPak → ok
→ Data → Data Analysis
SZS2017
Chose
Analysis
ToolPak
and click
Go
SZS2017
Tick Analysis
ToolPak
and click ok
SZS2017
→ Now we can use the Data Analysis
Application in Microsoft Excel to analyse data.
SZS2017
6. 1.2 STATISTICAL
PROBLEM- SOLVING
METHODOLOGY
Outline the six basic steps in the statistical
problem-solving methodology.
Identify various sampling methods.
Classify type of data and level of measurement.
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
1.2.1 Identify the Problem or Opportunity
The researchers must clearly understand and define the objective of the study
before conducting any research. Possible questions that could be asked before
starting any study are given as follows.
What are the problem and objective of the study?
What are the possible variables that are related to the study?
Can the study goal be achieved through simple counts or measurements of
the group?
What are possible treatments should be imposed on the group and what are
their responses?
Should the experiment be performed on the group?
Do the data come from population or sample?
If samples are needed, how large the sample size is appropriate? How
should they be taken?
SZS2017
7. Characteristics of Sample
A sample is a subset of population.
The population is a complete group of people, companies, hospitals,
stores, university, students, and etc., that share some set of
characteristics.
A census involves the whole population which possesses a greater
likelihood of non-sampling errors.
Sampling error is calculated when the statistical characteristics of a
population are estimated from a subset, or sample, of that population.
The difference between the sample and population values is considered as
a sampling error.
Non-sampling errors is an error that are not due to sampling. As example,
in a survey, mistakes may occur in the selection of people.
SZS2017
Characteristics of Sample Size
The larger the sample size, the smaller the magnitude of sampling errors
would be.
Studies using survey method need a larger sample size since the survey is
a voluntarily based.
Studies using mail response need a much larger sample size. Normally,
the response is as low as 20%-30% responses.
The ideal sample size in a study should be large enough to serve as an
adequate representative of the population in order to generalise the
overall population.
The optimal sample size depends on statistical distribution used and for
the purpose of generalisation to the whole population.
Researcher may refer to Krejcie and Morgan (1970) as a guideline to
obtain an adequate sample size.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
Data must be collected as complete as possible, accurate & relevant to the
problem in order to solve the problem.
Data could be obtained in 3 ways:
1) Data that are made available by others (internal, external, primary or
secondary data)
It is similar to historical or observed data.
The availability of the data depends on the primary and secondary
resources of document, evidence that includes interviews, observation
method, minutes of meeting, formal policy statement etc.
Example: Rainfall data collected from Malaysian Meteorological
Department is a secondary data.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
Data could be obtained in 3 ways:
2) Data resulting from an experiment (experimental study):
In an experimental study, the researcher manipulates one of the
variables and study on how the manipulation influences other variables
provided that the treatment and the subjects are assigned to groups
randomly.
Example: Blood glucose level data obtained from diabetic patients
before and after a treatment is an example of experimental data.
3) Data collected in an observational study (observation, survey,
questionnaire):
Observations VS interviews
SZS2017
8. Observation method
In qualitative research: used to study the behaviours or events and the
context that surrounds the behaviours or events and between the behaviour
and the event.
In quantitative research: used to collect data regarding the number of
occurrences in a specific period of the time, or duration of a very specific
behaviour or event.
The detail descriptions or data collected in qualitative research can be
converted later to numerical data and can be analysed quantitatively.
Observations method can be used in setting the physical environment, social
interactions, physical activities, non-verbal communications, planned and
unplanned activities.
Example: A study on customer’s behaviour towards type of brands in a
certain shopping complex is an example of observational study.
SZS2017
Interviews method
The purpose of interview in collecting data is to find out what is in or on
someone else’s mind.
Interview data can easily become biased and misleading if the interviewed
person is aware of the perspective of the interviewer.
It is very important to make sure the person being interviewed does not
hold any preconceived notions regarding the outcome of the study.
Interviews range from quite informal and completely open-ended to very
formal with the questions predetermined and asked in a standard manner.
Usually, interviews are used to gather information regarding an individual’s
experience and knowledge; his/her opinions, beliefs, and feelings, and
demographic data.
Example: An interviewer is interested to gather information on the way
nurses organise their care in hospital wards and conduct an interview
session.
SZS2017
Other Methods of Data Collection
• Questionnaires and surveys (Quantitative + Qualitative).
• Opinions (Qualitative + Quantitative).
• Projective technique and psychological tests (both).
• Proxemics – Study of people’s use of space and their relationship to
culture.
• Kinetics – Study of body movement or people communicate
nonverbally.
• Street Ethnography – Concentrate on a person becoming a part of
the place under study.
• Narratives – Study people’s individual life stories.
• Triangulation – The used of multiple data collection techniques
(Triangulation of data permits the verification and validation of
qualitative data.
SZS2017
EXERCISE 1.2.2
Identify each of the following studies as being either observational or
experimental.
a) Subjects were randomly assigned to two groups, and one group was
given a herb and the other group a placebo. After 6 months, the
numbers of respiratory tract infections each group were compared.
b) A researcher stood at a busy intersection to see if the colour of an
automobile a person drives is related to running red lights or not.
c) A researcher finds that people who are more hostile have higher
total cholesterol levels than those who are less hostile.
d) Subjects are randomly assigned to four groups. Each group is
placed on one of four special diets—a low-fat diet, a high-fish diet, a
combination of low-fat diet and high-fish diet, and a regular diet.
After 6 months, the blood pressures of the groups are compared to
see if diet has any effect on blood pressure or not.
SZS2017
9. 1.2.3 Collecting the Data
(Sampling Techniques)
Sampling is a process of selecting few samples from a population to
become the basis for estimating or predicting the prevalence of an
unknown piece of information, situation or outcome regarding the
bigger group.
i. Non-probability sampling (judgment, voluntary, convenience):
• Sample collected based on the judgment of the experimenter.
• Resulting samples might be biased.
ii. Probability sampling (random, systematic, stratified, cluster):
• The chances is known before the sample is picked.
• Resulting samples are unbiased.
Each collected data from a sampling process can be classified either as
a non-probability data or probability data.
SZS2017
Sampling
Techniques
Nonprobability
sampling
Judgment
Voluntary
Convenience
Others
Snowball
Quota
Probability
sampling
Random
Systematic
Cluster
Stratified
Others
Multi-stage
K-Sampling
Nested
SZS2017
A. Nonprobability Sampling Methods
Non-probability Sampling Methods Example
Judgment sampling
Data is selected based on opinion of one or
more experts.
A political campaign manager intuitively
picks certain voting districts as reliable
places to measure the public opinion of his
candidates.
Voluntary sampling
Questions are posed to the public by
publishing them over radio or television via
phone, short message, email etc. The
resulting sample tends to over represent
individuals who have strong opinions.
A call-in radio show asks their listeners to
participate in surveys on controversial
topics such as abortion, affirmative action,
gun control, politic, etc.
Convenience sampling
The data selected is an “easy sample”,
haphazard or accidental sampling.
The researcher obtains units or people who
are most conveniently available.
A surveyor will stand in one location and
ask passerby the questions.
SZS2017
B) Probability Sampling Methods
1. Random sampling
• Each data is numbered, and then the
data is selected using chance or
random method such as random
number.
• When a sample is chosen at random,
it is said to be an unbiased sample.
• Random sample can be selected with
or without replacement.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her
university. There are 5000 students enrolled at the university, and he/she wants to draw a
sample of size 100 to take a physical fitness test.
She could obtains a list of all 5000 students, numbered it from 1 to 5000 and then
randomly invites 100 students corresponding to those numbers to participate in the study.
SZS2017
10. Generating Random Number
• Generating random number is an important step in obtaining
random sample.
• In random number, each number has equal chance to be selected.
• Random number can be generated from calculator, softwares, or
random number table.
• As example, suppose we have data numbered from 1 to 100 and
we want to choose five samples only. Hence, using R-language we
can use the R command “sample (1: 100, 5)”. The resulted output is
the five number listed randomly.
SZS2017
B) Probability Data Samples
2. Systematic sampling
• A set of data is numbered from 1 to N .
• The first data is selected randomly within
number 1 and k where k=N/n and n
sample size.
• The next number are selected every k
interval to produce n samples.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her university
and he/she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list
of all 5000 students, numbered it from 1 to 5000 and randomly picks one of the first 50 voters
(k=5000/100) on the list. If the first picked number is 30, then the 30th student in the list
should be invited first. Then she should invite every 50th name on the list after this first
random number starts (the 80th student, the 130th student and so on) to produce 100 samples
of students to participate in the study.
1 2
, , , N
x x x
SZS2017
B) Probability Data Samples
3. Stratified sampling
• The population is divided into groups
according to some characteristic that is
important to the study, and then the sample
is selected from each group using random or
systematic sampling.
• The characteristics are homogeneous
(similar) within each group but
heterogeneous (dissimilar) among the groups
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between male and female students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
The random method or systematic method can be used to select the participants. As an
example, she use random sample to choose 50 male students and use systematic method
to choose another 50 female students or otherwise.
SZS2017
B) Probability Data Samples
4. Cluster sampling
• The population is divided into groups or
clusters, then some of those clusters are
randomly selected and all members from
those selected clusters are chosen.
• Cluster sampling can reduce cost and time.
• Each cluster has heterogeneous
characteristic but has homogeneous
characteristic among the clusters.
• We can choose more than one cluster.
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between 1st year, 2nd year, 3rd year and senior students. To account for this variation in
lifestyle, the population of student can easily be clustered into four categories.
Then, she can choose any clusters and chose all students in that clusters as the
participants. For example, all 2nd year students are chosen as the participants.
SZS2017
11. Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Judgement
Sampling
When the population
is too large.
- Fast and conclusive. - Biased since it based on
opinion of one or more
expert only.
Voluntary
Sampling
When the members
of the population are
convenient to be
sampled.
- Fast response.
- Easy to obtain lager
sample sizes.
- Samplings are too
random.
- Sometimes not reliable.
- Degree of generalisability
is questionable.
Convenience
Sampling
When the members
of the population are
convenient to be
sampled.
- Fast and easy.
- Convenience and
inexpensive.
- Samplings are too
random.
- Sometimes not reliable,
- Degree of generalisability
is questionable.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Random
Sampling
When the members of
the population are
similar to one another
on important
variables.
- Use table of random
number.
- Each data has an equal
chance to be selected.
- Ensures a high degree of
representativeness.
- High cost.
- Time consuming for large
sample size.
- Tedious.
Systematic
Sampling
When the members of
the population are
similar to one another
on important variables
- Relatively easy to
construct, execute,
compare and understand.
- The process can be
controlled.
- Good for tight budget
research.
- Ensures a high degree of
representativeness.
- No need to use a table of
random number.
- There is a risk of data
manipulation.
- Not the best method if the
researcher does not know
the background of the
population.
- Less random than simple
random sampling.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Stratified
Sampling
When the population
is heterogeneous and
contains several
different groups, some
of which are related to
the topic of the study.
- Variety of samples.
- Ensures a high degree of
representativeness of all
the strata or layers in the
population.
- Time consuming.
- Tedious.
Cluster
Sampling
When the population
consists of units rather
than individuals.
- Less energy and money.
- Easy and convenient.
- Save time.
- Possibly, members of units
are different from one
another, decreasing the
techniques effectiveness.
SZS2017
Random Data Generation
From Normal Distribution
𝑋~𝑁 𝜇, 𝜎2
𝑜𝑟 𝑍~𝑁 0, 1
𝜇 is mean
𝜎2
is variance
SZS2017
12. Random Data Generation
From Poisson Distribution
X~Po λ , λ is average
value
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
a) Suppose a researcher has a list of 1000 registered voters in a
community and he wants to pick a probability sampling of 50 samples.
He uses a random number table to pick one of the first 20 voters
(1000/50 = 20) on the list. The table gave him the number of 16, so he
selects the 16th voter on the list as the first selected number. Then he
picks every 20th name after the first random number start (the 36th
voter, the 56th voter, etc.) until 50 samples obtained.
b) In a consumer survey of large cities, a researcher divides a map of the
city into small blocks. Each block containing a cluster is surveyed. A
number of clusters are selected for the sample, and all the households
in a cluster are surveyed. Less energy and money are expended if an
interviewer stays within a specific area rather than traveling across
stretches of the cities.
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
c) Researchers or farm managers may be called in when a crop shows a certain
growing pattern or when surface differences are observed for a soil. For
example, differences may occur in soil color which may be the result of many
factors. A researcher is called to judge a particular shade of colour to be
typical for a sample at certain sites. Then from these sites, samples are
drawn.
d) The population of university professors is divided into groups according to
their rank (instructor, assistant professor, etc.) and several are selected from
each group to make up a sample.
e) A surveyor stands outside a shop in the East Cost Mall and randomly selects
people to participate in a quiz.
f) A quality engineer wants to inspect rolls of wallpaper in order to obtain
information on the rate at which flaws in the printing are occurring. She
decides to draw a sample of 50 rolls of wallpaper from a day’s production. At
the end of each hour, for 5 consecutive hours, she takes the 10 most
recently produced rolls and counts the number of flaws on each.
SZS2017
MIND EXPANDING EXERCISES
1. Statistics can be applied across many disciplines or any fields of
research and almost in all fields in human endeavour. Based on this
statement, suggest reasons why statistics is important.
2. Is a large sample necessarily a good sample? Why or Why not?
3. Suppose you have been hired by a radio station in Malaysia to
determine the age distribution of their listeners. Describe in detail
how you would select at least 3000 sample of listeners. Chose the
best sampling techniques and state the reason. The sampling
techniques can be mix or combine.
SZS2017
13. In this step, the collected data are organised properly for further study and
investigation.
Data that has been collected during the sampling process is called raw data.
The simplest way to organise raw data systematically is by using data array.
Data array is an arrangement of data items in either ascending or
descending order (sorting).
1.2.4.1 Classifying
identify items with the same characteristics & arranging them into
groups or classes.
Data could be classified by its type or by its level of measurement.
1.2.4.2 Summarisation
Graphical & Descriptive statistics ( tables, charts, measures of central
tendency, measures of variation, measures of position)
1.2.4 Classifying and Summarising
the Data
SZS2017
Example of Raw Data
Data can be organised
by column or row
SZS2017
1.2.4.1 Data Classification
Data are the values that variables can assume.
Variables is a characteristic or attribute that can assume different values.
Variables whose values are determined by chance are called random
variables.
Data can be
classified
By how they are categorized, counted
or measured
- Level of measurements of data
As Quantitative or
Qualitative type
SZS2017
Qualitative
(categorical/Attributes)
Data that refers to
classification name according
to some characteristic or
attribute
Data is classified using code
numbers
Quantitative (Numerical)
Data can be counted or
measured
Data can be ordered or ranked
Nominal Data
The values cannot be ranked
Gender, race, citizenship,
colour, etc.
Ordinal Data
The values can be ranked and
likert scale is used
Feeling (dislike-like),
colour (dark-bright), etc.
Discrete Data
The values can be counted and finite
Number of student, number of cat,
number of defect, etc.
Continuous Data
The values can be placed within two
specified values, obtained by measuring,
have boundaries, and shall be rounded to
require decimal places
Weight, age, salary, temperature, etc.
Use code
numbers
(1, 2,…)
Type
of
Data
SZS2017
14. Levels of Measurement of Data
Levels Descriptions Examples
Nominal-level Classifies data into mutually
exclusive (non-overlapping),
exhausting categories in which
no order or ranking can be
imposed on the data.
Zip code (4, 5, 6,…),
Post code (25000, 25600, …),
Gender (female, male),
Eye colour (blue, brown, green, hazel),
Political affiliation, Religion,
Nationality
Ordinal-level Classifies data into categories
that can be ranked; however, any
specific differences between the
ranks do not exist.
Grade (A, B, C, D, etc.),
Judging (first place, second place, etc.),
Rating scale (poor, good, excellent).
Color (light blue, …, dark blue)
Interval-level Ranks the data, and precise
differences between units of
measure do exist; however, there
is no meaningful zero.
IQ test
Temperature
Shoe size
Ratio-level Possesses all the characteristics
of interval measurement, and
there exists a true zero.
Height, Weight, Time, Salary
SZS2017
1. The SuperMotor Marketing Corporation has asked you for information
about the car you drive. For each question, identify each of the types of data
requested as either attribute data or numeric data. When atribute data is
requested, identify the variable either as nominal or ordinal. When
numeric data is requested, identify the variable either as discrete or
continuous. Then, identify the level of measurement for each variable.
a) What is the weight of your car?
b) In what city was your car made?
c) How many people can be seated in your car?
d) What is the distance traveled from your home to your school?
e) What is the color of your car?
f) How many cars are in your household?
g) What is the length of your car?
h) What is the normal operating temperature (in C) of your car’s engine?
i) How much does the petrol mileage (km/l) do you get in city driving?
j) Who made your car?
k) How many cylinders are there in your car’s engine?
l) How many kilometres have you put on your car’s current set of tyres?
EXERCISE 1.2.4.1
SZS2017
2. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
a) What are the variables under study?
b) Categorise each variable either as qualitative or quantitative.
c) Categorise each quantitative variable either as discrete or
continuous.
d) Categorise each qualititative variable either as nominal or ordinal.
e) Identify the level of measurement for each variable.
Type of transportation
Industries
Number of job related
injuries
Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950
EXERCISE 1.2.4.1
SZS2017
1.2.4.2 Data Summarisation
1) Descriptive statistics (refer Section 1.3)
Typically used to confirm conjectures about the data.
Quantitative data: measures of central tendency, measures of
variation (dispersion) and measures of position.
Qualitative data (non-numeric quality (attribute) or category):
measure the relative frequency for a particular characteristic
and calculate its percentage.
b) Graphical Summary
Organise the data in some meaningful way by constructing a
frequency distribution (refer Appendix A.1) for quantitative or
qualitative data.
A frequency distribution is the organisation of raw data in
table form, using classes and frequency
SZS2017
15. Graphical Statistics
The purpose of graphs in statistics is to convey the data to the viewer in pictorial
form and getting the audience’s attention in a publication or a presentation.
Histogram Frequency Polygon Ogive Bar Chart
Pareto Chart Pie Chart Time Series Graph
SZS2017
Histogram, Frequency
Polygon, Ogive
Histogram
For quantitative data.
Describe grouped
frequency data
distribution.
Displays the data by using
contiguous vertical bars of
various heights to represent
the frequency of the classes.
Frequency Polygon
For quantitative data.
Describe grouped frequency
data distribution.
Displays the data by using
lines that connect points
plotted for the frequencies at
the midpoints of the classes.
The frequencies are represented
by the heights of the points.
Ogive
For quantitative data.
Represents the cumulative
frequencies for the classes in a
grouped frequency data
distribution.
Visually represent how many
values are below a certain upper
class boundary.
Distribution Shapes for Histogram
Bell-Shaped Uniformed J-Shaped Reverse J-Shaped
Right Skewed Left Skewed Bimodal U-Shaped
SZS2017
Bar Chart, Pareto Chart,
Pie Chart
Bar Chart
For quantitative data, the bar
represents the mean values.
For qualitative data, the bar
represents the heights or length
whose represents the
frequencies of the data.
The bars can be vertical or
horizontal.
Pareto Chart
Used to represent a frequency
distribution for a categorical
variable.
The frequencies are displayed
by the heights of vertical bars
which are arranged in
decreasing order.
Pie Chart
A circle that is divided into
sections or wedges according
to percentage of frequencies in
each category of the
distributions.
Pie charts show the relationship
between classes in a set of data
with the whole data.
16. Stem and Leaf Plot, Time
series graph
Time Series Graph
Represents data that occur over
a specific period of time.
For analysis, we look at the
trend or pattern (increasing or
decreasing) that occurs over the
time period.
Further analysis will look at the
slope or the steepness of the line
(rapid increase or decrease).
Stem and leaf plot
The leading digit is plotted as the stem and the trailing digit as the leaf to
form groups or classes.
A key indicator is used to define the stem and leaf values.
If the plot is rotated in horizontal position, we can see the shape of the
data distribution
For a mixture stem and leaf plot, the shape of distribution for the left side
may be seen by reflecting the plot to the right side.
We may analyse the variability of the data by looking at the spread of the
stem and leaf plot.
A stem and leaf plot is also good in showing the range, minimum,
maximum, mode, gaps, clusters, and outliers.
Selection of appropriate statistical
techniques for data summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative
(ratio scale)
Mean, Median, Mode,
Range, Standard Deviation,
Interquartile range (IQR
=Q3-Q1)
Histogram, Bar Chart (bar
representing means), stem
and leaf plot, Boxplot
Symmetrical
Distribution
Mean, Median, Mode,
Range, Standard Deviation
Histogram, Bar Chart (bar
representing means)
Skewed Distribution Median, Range, Interquartile
range (IQR =Q3-Q1)
Histogram, Stem and leaf
plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart
Categorical
(Ordinal, Likert Scale)
Mode, Mean, Counts,
Percentage
Pie Chart, Bar Chart
SZS2017
1.2.5 Presenting and
Analysing the Data
Analysed information given by the
Descriptive statistics (refer topic 1.3)
Graphical summary (graph and chart)
Identify if there exist any relationship in the variables under
study.
Making any relevant statistical inferences
confidence interval, hypothesis testing, ANOVA, goodness of fit
test, contingency table, regression, correlation, etc.
SZS2017
BASIC INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Confidence Intervals
(CHAPTER 2)
An estimated range of values which is likely to include an unknown population
parameter, 𝜃 with a specified probability (confidence level) within that interval.
The interval is usually written as 𝒂, 𝒃 or 𝒂 < 𝜽 < 𝒃.
Hypothesis Testing
(CHAPTER 3)
A statement (claim or conjecture or assertion) concerning a parameter or
parameters of one or more populations.
• Statistical Analysis for one population (mean, variance, proportion)
• Statistical Analysis for two populations (mean, variance, proportion)
Analysis of Variance
(ANOVA)
(CHAPTER 4)
Statistical Analysis for three or more populations mean
• One-way ANOVA
• Two-way ANOVA and Post Hoc Test
Linear Regression
Analysis
(CHAPTER 5)
A statistical measure that attempts to determine the strength of relationship
between dependent (y) and independent variables (x).
• Simple linear regression analysis and correlation. (y vs x)
• Multiple linear regression analysis and correlation. (y vs xi)
• Model selection technique to chose a parsimony model that best fit the data.
Statistical Analysis for
Categorical Data
(CHAPTER 6)
1. Tests concerning frequency distributions for categorical data
(Goodness of Fit)
2. Tests concerning specific probability distributions (Goodness of Fit)
3. Test the Independence of two variables (Contingency Table)
4. Test the homogeneity of proportions (Contingency Table)
17. ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Experimental
Design (DOE)
Planning, conducting, analysing and interpreting controlled tests to evaluate the factors
that control the value of a parameter or group of parameters.
Example: ANOVA, Single factor experiment, Randomized Blocks, Latin Squares and
Related Design, Factorial Design, Response Surface Methodology, Nested and Split-Plot
Design
Time Series
Analysis
Modelling, making inference and producing forecast time series data for future
observations. Time series models are built to represent the serially correlated series,
trends, or seasonal effects.
Example: Linear Time Series, Linear Stationary Models (AR, MA, ARMA), Linear
Nonstationary Models (ARIMA, SARMA), Box-Jenkins Models, Volatility Models (ARCH,
GARCH), Hybrid models
Multivariate
Analysis
A central tool whenever many variables need to be considered at the same time.
Example: Mean Vector and Covariance Matrix Estimation, MANOVA, Principal
Component Analysis, Factor Analysis, Canonical Correlation Analysis, Discriminant
Analysis, Cluster Analysis
Statistical Quality
Control (SQC)
Quality improvement through the use of modern statistical methods for quality control
Example: Variables control charts, Attribute Control Charts, Time-Weighted Control
Charts, Multivariate Control Charts
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Statistical
Modelling
A mathematical equations that relate one or more random variables and possibly
other non-random variables, concerning the generation of some sample data and
similar data from a larger population.
• Example of Statistical Models: Generalised Linear Model, Dependence model,
Regression, Bayesian, markov chain, Random effect and mixed model
• The Process involve: parameter estimation, data generation, missing values,
outlier detection, simulation study, bootstrap, goodness of fit test
Data Mining A computing process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database system.
Example: Decision Tables, Decision Trees, Classification Rules, Association Rules,
Decision Tress, Clustering, Advanced linear model, Bayesian, Instance-based Learning
Circular Statistics A branch of statistics that involve circular data which deal with direction or cyclic
time. Circular data are measured in degrees (0,2π] or radian (0o, 360o].
Example: orientation of an animal, direction of wind and wave, days of the week,
compass direction, waves of sound, the human perception under various conditions,
the orientation of ridges of fingerprints, the orientation of sand grains from a beach,
the death due to a disease at various times in a year, and astronomical observations.
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Advanced Regression
Analysis
• Polynomial Regression: y is modelled as an nth degree polynomial in x
• Multivariate Regression: Y is a matrix with series of multivariate dependent
measurements and X is a matrix of observations on independent variables.
• Generalized Linear Model: A flexible generalization of ordinary linear
regression that allows for response variables that have error distribution
models other than a normal distribution.
• Logistic Regression: A regression model where the dependent variable is
categorical.
• Nonlinear Regression: The observational data are modeled by a function
which is a nonlinear combination of the model parameters and depends on
one or more independent variables
• Error in Variables: a regression model that account for measurement errors
in the independent variables.
1.2.6 Make the decision
and conclusion
The researchers can make decisions in order to achieve the
objective and goal of the research and choose the best options
which represents the ‘best’ solution to the problem.
The correctness of this choice depends on the analytical skill of
the researchers and quality of the information.
SZS2017
18. 1.3 REVIEWS ON
DESCRIPTIVE
STATISTICS
Summarise the data using measures of central
tendency, such as the mean, median, mode, and
midrange.
Describe the data using measures of variation, such
as the range, variance, standard deviation and
coefficient of variation.
Identify the position of a data value in a data set
using measures of position such as quartiles, deciles,
and percentiles.
SZS2017
Reviews on
Descriptive Statistics
Descriptive statistics is typically used to confirm conjectures
about the data.
We can summarise data using measures of central tendency,
measures of variation, and measures of position.
Some classified these type of measures as traditional
statistics.
If the measurement describes about a population
characteristic, it is called a parameter.
If the measurement describes about a sample characteristic,
it is called a statistic.
SZS2017
RULE OF THUMB FOR DECIMAL
PLACES
1. In general, the calculated parameter or statistic value should
be rounded to four (4) decimal places.
2. If the unit is given (in cm, minute, day, etc.), the value should
be rounded to that unit’s decimal places.
SZS2017
TIPS: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
as population
standard deviations, and 1
n
x or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
19. 1.3.1 Measures of Central Tendency
Measures of central tendency are also called measures of
average
1. mean
2. median
3. mode, and
4. midrange.
The measures of central tendency are use to describe an
entire set of observations with a single value representing the
central or middle value of the data set.
Can roughly describes
the shape of
distribution of a
certain data set
SZS2017
Is a rough estimate of the middle
lowest value highest value
MR
2
Midrange (MR)
EXAMPLE 1.3:
If the data set is 1, 3, 5, 7, 7, 8, then the calculated midrange is,
1 8
4.5
2
MR
.
Properties of Midrange
A rough estimate of the average
Can be affected by one extremely high or low value (outlier).
SZS2017
Mean
Is the sum of the values divided by the total number of values
Population Mean Sample Mean
1
, population size
N
i
i
x
N
N
1
, sample size
n
i
i
x
x n
n
If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is 5.1667
if the data is taken from the population.
The value is a true mean or a parameter.
‒ the calculated mean is 5.1667
x if the data is taken from the sample.
The value is a sample mean or a statistic.
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
as population
standard deviations, and 1
n
x or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
20. Is the middle number of n ordered data (smallest to largest)
If n is odd If n is even
1
2
Median(MD) n
x
1
2 2
Median(MD)
2
n n
x x
Median
If the data set is 1, 3, 5, 6, 7, then the calculated median is, 3
Median 5
x
.
If the data set is 1, 3, 5, 7, 7, 8, then the calculated median is, 3 4
Median 6
2
x x
.
SZS2017
Is the most commonly occurring value in a data series
Mode
EXAMPLE 1.4:
a) If the data set are 1, 6, 3, 7, 8, 5 then the mode is not exist.
b) If the data set are 1, 6, 3, 7, 8, 3, 5 then the mode is 3.
c) If the data set are 1, 6, 3, 7, 3, 8, 7, 5, 3, 7 then the mode is 3 and 7.
The mode is used when the most typical case is desired.
The mode is can be used when the data are nominal.
The mode is not always unique.
A data set can have more than one mode, or the mode may not
exist for a data set.
Properties of Mode
SZS2017
Identify the Shapes of Data
Distribution
Symmetric Positively skewed /
right-skewed
Negatively skewed/
left-skewed
Mean Median Mode
Mean Median Mode
Mean Median Mode
→In reality, median can be greater than mode or mean values.
→The shape of the distribution may be identified by observing the
position of the mode value.
SZS2017
EXAMPLE 1.3
If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is 5.1667
if the data is taken from the
population. The value is a true mean or a parameter.
‒ the calculated mean is 5.1667
x if the data is taken from the sample.
The value is a sample mean or a statistic.
‒ the calculated median is, 3 4
Median 6
2
x x
.
‒ the mode is 7.
‒ the shape of distribution is negatively skewed since
Mean Median Mode
.
SZS2017
21. RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
as population
standard deviations, and 1
n
x or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
The mean is unique, and not necessarily one of the data values.
The mean is affected by extremely high or low values and if it occurs, the
mean may not be the appropriate average to use.
The mean is used in computing other statistics, such as variance.
The mean cannot be computed for an open ended frequency distribution.
The mean varies less than the median or mode when samples are taken from
the same population and all three measures are computed for these samples.
The mean is not an appropriate average to use if the shape of distribution is
skewed.
The median is used when one must find the center or middle value of a data
set.
The median will make sure that the data values fall into the upper half or
lower half of the distribution.
The median is affected less than the mean by extremely high or extremely low
values.
Properties of Mean and Median
SZS2017
EXAMPLE 1.5
An extreme value, let say 21 is added to the data set in Example 1.3. The new
data set are 1, 3, 5, 7, 7, 8, 21. Assume that the data is taken from a sample, then
‒ the calculated mean is 7.4286 or 7.4286
x . The mean is easily affected by
outliers and may not be the appropriate average to use. This new average
value is no longer representing the central of the data set.
‒ the calculated median is 7 or 4
Median 7
x
. This new average value is
still representing the central of the data set.
‒ the mode is 7.
‒ the calculated midrange is,
1 21
11
2
MR
. The midrange is easily
affected by outliers.
‒ the shape of distribution is positively skewed since mode is the smallest
value as compared with the mean and median values.
An extremely high or low value data that occur in a data set is called outlier.
SZS2017
EXERCISE 1.3.1
1. Determine the shape of distribution of the following
data.
a) Mean = Mode = Median = 11
b) Mean = 25, Mode = 13, Median = 17
c) Mean = 5, Mode = 73, Median = 17
d) 11.4, 11.6,12.6,12.7, 12.8, 13.3, 13.3, 13.6, 13.7,
13.8
SZS2017
a) symmetric b) right-skewed c) left-skewed d) Mean = 12.88, Median = 13.05, mode = 13.3, left-skewed
22. EXERCISE 1.3.1
2. The following set of data represents the number of hospitals
for selected countries.
123 108 195 138 115 179 119 148 147 180
146 178 189 108 193 114 179 147 108 128
164 174 128 159 193 175
a) Find the mean, median, mode, and midrange.
b) Is the average values calculated in (a), a parameter or a
statistic? Why?
c) What is the distribution type that describes the data?
d) What is the best measure of average of this set of data?
Why?
SZS2017
a) Mean = 151.3462, Median = 148, mode = 108 b) statistic c) right-skewed d) median
1.3.2 Measures of Variation/Dispersion
Measures of variation or measures of dispersion are measures
that determine the spread of data values.
1. Range: the simplest measure of variation
2. Variance, and
3. Standard deviation.
4. Coefficient of Variation
Measures of variation may help researchers to describe data
more accurately.
Variance and standard deviation are used quite often in
inferential statistics.
more meaningful and popular
measures that describes the
variability of data
SZS2017
Is the different between the highest value and the lowest value in a
data set
R = highest value - lowest value
Range (R)
Properties of Range
The simplest measure of variation.
Easily affected by one extremely high or low value (outliers).
EXAMPLE 1.6:
Suppose the data set is 1, 6, 3, 7, 8, 5, then the calculated range is, 8 1 7
R .
SZS2017
Population Variance Sample Variance
2
2 1
, population size
N
i
i
x
N
N
2
2 1
, sample size
1
n
i
i
x x
s n
n
Is the average of the squares of the distance each value is from the mean.
Is the square root of the variance
Population standard deviation , Sample standard deviation, s
2
1
, population size
N
i
i
x
N
N
2
1
, sample size
1
n
i
i
x x
s n
n
Variance
Standard Deviation
SZS2017
23. Properties of Variance & Standard Deviation
The variance is the average of the squares of the distance each value
is from the mean.
If the data values are near the mean, the variance will be smaller.
If the data values are far from the mean, the variance will be larger.
The square distance is used since the sum of the distances will
always be zero.
Variance is always a positive value.
There is no unit for the resultant variance.
Standard deviation is the square root of the variance.
Standard deviation is measure of deviations of values from the
mean.
Standard deviation is always positive value.
The units of standard deviation are similar as the unit of the data.
SZS2017
Population CVar Sample CVar
CVar 100%, for population
CVar 100%, for sample
s
x
Is the standard deviation divided by the mean.
Coefficient of Variation
Properties of CVar
The result is expressed as percentage.
A parameter/statistic that allows user to compare the standard deviations
when the units are different (the variables are different).
RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
as population
standard deviations, and 1
n
x or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
EXAMPLE 1.6
SZS2017
Suppose the data set is 1, 6, 3, 7, 8, 5, then
‒ the calculated range is, 8 1 7
R .
‒ the calculated variance is 2
5.6667
and the standard deviation is 2.3805
if the data is taken from the population. These values are called as parameters.
‒ the calculated variance is 2
6.8
s and the standard deviation is 2.6077
s if the
data is taken from the sample. These values are called as statistics.
‒ the calculated sample mean is, 5
x . Hence the sample coefficient of variation
is
2.6077
CVar 100% 52.15%
5
.
24. Why we Need Measures of Variation
• Measures of variation can be a judgment about how well the
measures of average illustrate or depict the data.
• It is also called measure of variation because it can measure the
variability that exists in a data set.
• It can be used when the measures of central tendency do not give
any significant meaning or not needed/practical.
EXAMPLE:
Suppose we wish to compare the performance of two groups of student
in a test. Given that the mean values are the same for both data sets.
In short, you might conclude that these two groups of students are
equally well performed in the test. However, if the data sets are
examined graphically as shown in Figure 1.10, a different conclusion
might be drawn.
SZS2017
Examining Data Sets Graphically
Both group have same total number of students.
Students are given the same set of test and the mean of score is
calculated as 66.67 marks for each group of students.
The mean values are the same but the spread or variation of the
test score is quite different.
The test score for students from Group B is more consistent and
less variable.
When the mean values are equal, the larger the data range is, the
more the variable the data.
SZS2017
Comparing Two Data Sets
Smaller standard deviation indicate that:
POPULATION 1 is POPULATION 2 is
Less dispersed
Less spread
Less variable (small variation)
More consistent
More precise
More accurate
Better data
More dispersed
More spread
More variable (large variation)
Less consistent
Less precise
Less accurate
Worse data
1 2
Same interpretation is applicable for range and variances
SZS2017
EXAMPLE 1.7
The following data represents the age (in years) of lecturers in two faculties at UMP.
FIST: 24, 25, 26, 27, 30, 31, 31, 32, 36, 40, 43, 44, 45
FKEE: 22, 25, 25, 25, 28, 33, 34, 36, 37, 40, 41, 43, 48, 51, 53
For these sample data sets, find the standard deviations. Then, identify which data set
is more consistent and less dispersed. What can you say about the variation of age for
lecturers in both faculties?
Solution:
7.4670
FIST
s years
9.9460
FKEE
s years
FIST FKEE
s s
, so FIST data is more consistent and less dispersed.
The variation of ages for lecturers in FIST is small and less dispersed as
compared to FKEE lecturers.
SZS2017
25. 1. Which of the following set of sample data is less variable?
Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
2. The following set of sample data represents the battery
lifetime (in hours) from two different brands. Which brand of
battery is performed better?
A: 4.2, 6.7, 7.3, 7.5, 8.0,8.5, 8.7, 8.8, 9.2, 9.3
B: 9.6, 9.7, 9.8, 9.9, 10.1, 10.2, 11.0, 11.0, 11.0, 11.1
EXERCISE 1.3.2 (Q1&Q2)
SZS2017
3.6742 7.8493
A B
s s
1.5 hours 0.6 hours
A B
s s
Comparing Two Data Sets with
different units/variable
SZS2017
If the two samples do not have the same units of measurement or the
variables are different, the variance and standard deviation for each
sample cannot be compared directly.
As an example: suppose a car dealer wants to compare the variation
between the number of sales of car for a year and the commission (in
RM) made by the salesperson. It is very clear that these two
variables have two different units.
Hence, the best way to compare the variability within these two
variables is by using the coefficient of variation.
It is means that if
1 2
CVar CVar
, then the variable one is less
variable than the variable two.
3. The average age of the accountants at a huge company is 31
years with a standard deviation of 4 years. The average
salary of the accountants is RM 44255 per year with a
standard deviation of RM 780. Compare the variations of
age and income.
EXERCISE 1.3.2 (Q3)
SZS2017
CVar 12.90% CVar 17.63%
age income
Other Properties of Standard Deviation
Use to determine the number of data values that fall within a
specified interval in a distribution.
The values under curve indicate the percentage of area in each
section or range of data.
It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
26. 1.3.2.1 Accuracy and Precision
Concept (Validity and Reliability)
Accuracy is how close a measured
value to the ‘true’ measurements.
No measurement/device is
perfect (can easily be inaccurate
and lead to false measurements).
There is still a tolerance for error.
Accuracy must be accounted for in
your results.
The bigger the difference between
the measured and the true values,
the less accurate (less valid) the
measurement.
Precision is how close the measured
value to each other or how consistent
your results are for the same
phenomena over several
measurements.
Precision as a measure of variation
must be accounted in your
calculations and results.
The precision of a measurement is the
size of unit used to make a
measurement. The smaller the unit,
the more precise (more reliable) the
measurement.
→ The concept is important to ensure that data collected from an
experiment or observation is good, valid, and reliable.
SZS2017
Game of Darts
• A very accurate
(close to the mark)
measurements, but
not very precise,
since the darts are
spread out
everywhere
• Valid but not
reliable
• Precision
without
accuracy
• Very
consistent, but
not near the
mark
• Not valid but
reliable
• Inaccuracy and
imprecision
• Not valid and not
reliable
• Accurate and
precise.
• Valid and reliable
• Very good
measurement
SZS2017
EXERCISE 1.3.2 (Q4)
4. Identify each situation as either accurate or precise or both.
a) If you are playing football and you always hit the left goal post
instead of scoring.
b) A candy manufacturer claims that each packet contains 20 candies.
A sample of packet have 18, 21, 19, 21, 19, 20, 22 candies,
respectively. The average is 20 candies with an error of 1 candy.
c) A manufacturer claims that each chocolate packet contains 20
chocolates. A sample of packets have 17, 18, 18, 17, 18, 17, 17
chocolates, respectively.
d) In an experiment, with five trials, the end results of the five trials for
whatever is being tested are: 35 kg, 36 kg, 36 kg, 35 kg, 36 kg. The
actual value (as found in a scientific data book) is meant to be 42 kg.
e) In an experiment, with five trials, the average value is 35 kg. The
actual value (as found in a scientific data book) is meant to be 35 kg.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
5. Which do you think has more variation: the IQ scores of 30 students
in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?
6. Explain why median and interquartile range are more appropriate
measures as compared to mean and variance for non-normal data.
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
27. MIND EXPANDING EXERCISES
8. In an analysis of the accuracy of weather forecasts, the actual high
temperature are compared to the high temperatures predicted one day earlier
and the temperatures predicted five days earlier. Listed below are the errors
between the predicted temperatures and the actual high temperatures for 14
consecutive days in Kuala Lumpur.
a) Do the means and medians of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
b) Do the standard deviations of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
SZS2017
Actual high ‒
High predicted one day earlier
2 2 0 0 ‒ 3 ‒ 2 1
‒ 2 8 1 0 ‒ 1 0 1
Actual high ‒
High predicted five days earlier
0 ‒ 3 2 5 ‒ 6 ‒ 9 4
‒ 1 6 ‒ 2 ‒ 2 ‒ 1 6 ‒ 4
ME.8 (solution)
SZS2017
Mean median sd
1.5000 1.0000 2.4152
3.8333 4.5000 2.4014
MIND EXPANDING EXERCISES
9. A data set consists of 20 values that are fairly close together. Another
value is included, but this new value is an outlier (very far away from
the other values). How is the standard deviation affected by the
outlier? No effect? A small effect? Or a large effect?
10. Suppose scores on psychological test have a mean of 90 and a standard
deviation of 10. Meanwhile, scores on the economics test have a mean
of 55 and a standard deviation of 5. Which is relatively better: a score
of 85 on a psychological test or a score of 45 on an economics test?
11. When designing the production procedure for batteries used in heart
pacemakers, an engineer specifies that “the batteries must have a
mean life greater than 10 years, and the standard deviation of the
battery life can be ignored.” If the mean battery life is greater than 10
years, can the standard deviation be ignored? Why or why not?
SZS2017
1.3.3 Measures of Position
Describe where a specific data value falls within the data set or its
relative position based on percentiles, deciles and quartiles in
comparison with other data values
SZS2017
Describing the position of
the data value
(increasing order)
Percentiles
Split data into
100 equal parts
Deciles
Split data into
10 equal parts
Quartiles
Split data into
4 equal parts
4
i in c
Q x x
10
i in c
D x x
100
i in c
P x x
28. 4
i in c
Q x x
10
i in c
D x x
100
i in c
P x x
If c is not a whole number, round it up to the next whole number.
If c is a whole number, then use 1 1 1
, ,
2 2 2
c c c c c c
i i i
x x x x x x
Q D P
SZS2017
EXAMPLE 1.9
The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45
Quartiles Percentiles
1 2.75 3
1 11
4
27
Q x x x
2 5.50 6
2 11
4
36
Q x x x
3 8.25 9
3 11
4
42
Q x x x
25 2.75 3
25 11
100
27
P x x x
50 5.50 6
50 11
100
36
P x x x
75 8.25 9
75 11
100
42
P x x x
Summary: 1
Q equivalent to 25;
P 2
Q equivalent to 50;
P 3
Q equivalent to 75.
P
A manufacturer measured the volume of a sample of 11 bottles of chemical
solvents. The results are recorded (in millilitres) as follows.
40 45 38 25 42 31 30 44 26 27 36
SZS2017
Show that 1
Q equivalent to 25,
P 2
Q equivalent to 50,
P 3
Q equivalent to 75
P , and i
D
equivalent to (10),
i
P where 1, 2, , 9
i .
EXAMPLE 1.9
The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45
Deciles Percentiles
3 3.3 4
3 11
10
30
D x x x
5 5.5 6
5 11
10
36
D x x x
7 7.7 8
7 11
10
40
D x x x
30 3.3 4
30 11
100
30
P x x x
50 5.5 6
50 11
100
36
P x x x
70 7.7 8
70 11
100
40
P x x x
Summary: i
D equivalent to (10) ,
i
P where 1, 2, 3, 4, 5, 6, 7, 8, 9
i .
SZS2017
EXERCISE 1.3.3
1. Given a set of data as 9 2 1 4 3 7 5 4 6 .
a) Find the value corresponds to 4th deciles.
b) Find the value corresponds to 3rd quartiles.
2. A teacher gives a 25-point test to ten students. The scores
are shown below.
9 22 11 14 13 3 7 15 18 16
a) Find the score corresponds to 20th percentiles.
b) Find the score corresponds to 7th deciles.
SZS2017
1) 4, 6 2) 8, 15.5
29. Why We need Measures of Position?
Percentiles are one of measures of position that often used in
educational and health related fields to indicate the position
of an individual in a group.
Percentile is not a percentage value. The ith percentile, is a
value that i % of the data are less than or equal to Pi and
(100-i) % are greater than or equal to Pi.
EXAMPLE:
If a student obtained 82 marks over 100 in a test , he/she will
obtain 82% of score. However, there is no indication of his/her
position with respect to the rest of the class. On the other hand,
if his/her score corresponds to the 75th percentile, then he/she
did better than 75% of the students in his/her class.
SZS2017
Why We need Measures of Position?
Quartiles can be used as a rough measurement of variability.
INTERQUARTILE RANGE (IQR)
defined as the difference between Q1 and Q3 and is the range
of the middle 50% of the data.
used to identify outliers, and to measure variability in
exploratory data analysis (Section 1.4).
the smaller the value of IQR; the smaller the variation in the
data.
useful to show the variability of the data set, either its more
variation, more dispersed, more spread or more consistent.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
5. Which do you think has more variation: the IQ scores of 30 students
in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?
6. Explain why median and interquartile range are more appropriate
measures as compared to mean and variance for non-normal data.
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
1.3.4 Descriptive Statistics
Using Microsoft Excel
SZS2017
30. Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ 1 2
,
x x
in average Assembly Line 2 produced more
number of finished units per day.
→
1 2 1 2 1 2
, and s.e s.e .
s s R R
The arrangements
of Assembly Line 1 is more consistent, less dispersed,
less spread, less variable (small variation), and more
precise. Therefore the arrangements of Assembly
Line 1 will be adopted as the new arrangement.
→ For Assembly Line 1, the distribution of data is
negatively skewed or left-skewed since
Mean Median Mode
. The skewness value is
negative too.
→ For Assembly Line 2, the distribution of data is also
negatively skewed or left-skewed since the mode is
the highest value compared to mean and median. The
skewness value is negative too.
SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ The skewness value for Assembly Line 2 is higher
that the Assembly Line 1. Hence the distribution of
data from Assembly Line 2 is more skewed to the
left, indicating that Assembly Line 2 produced more
number of finished units per day.
→ For Assembly Line 1,
1
Confidence Level 491.1 17.1 474,508.2
x .
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 1 is lies between 474 and 509 units.
→ For Assembly Line 2,
2
Confidence Level 499.4 25.2 474.2,524.6
x
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 2 is lies between 475 and 525 units.
SZS2017
MIND EXPANDING EXERCISES
12. A lecturer is interested to investigate the students’ performance in
statistics course based on their carry mark and the final score in
the final examination. The descriptive statistics and graph are
given below. From the analyses, comment on the students’
performance based on carry marks and final examination scores.
SZS2017
MIND EXPANDING EXERCISES
ME.12
SZS2017
31. MIND EXPANDING EXERCISES
13. A study is conducted to compare the performance of male and female
students in the statistics course for final examination scores. The
data, descriptive statistics and graph of the final examination scores
are presented as follow. Based on the analysis, answer the following
questions:
Female
72 62 83 65 60 74 66 68 57 63 61
76 60 78 34 70 59 63 86 43 90 87
Male
58 81 86 68 70 77 54 54 72 41 33 52
70 37 67 39 74 32 8 33 27 23 54
SZS2017
MIND EXPANDING EXERCISES
a) State the mean and standard deviation for both groups and give your
comment.
b) Based on the graph shown, give your comment.
ME.13
SZS2017
MIND EXPANDING EXERCISES
14.People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
a) How many data in each group?
b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?
SZS2017
MIND EXPANDING EXERCISES
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Before After
Key: 14|1=141
SZS2017
32. 1.4 EXPLORATORY
DATA ANALYSIS
Identify outliers.
Draw and interpret a boxplot.
SZS2017
Exploratory Data Analysis
The purpose of exploratory data analysis is to discover any gaps or
pattern in the data.
For symmetric data, the appropriate measure of central tendency
is mean and for variability is standard deviation or variance.
For skewed data, the appropriate measure of central tendency is
median and for measure of variability is interquartile range (IQR).
Traditional Method Exploratory Data Analysis
Frequency distribution Stem and leaf plot
Histogram Boxplot
Mean Median
Standard deviation
Interquartile range
(IQR=Q3-Q1)
SZS2017
RECALL: Selection of appropriate
statistical techniques for data
summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative
(ratio scale)
Mean, Median, Mode,
Range, Standard Deviation,
Interquartile range (IQR
=Q3-Q1)
Histogram, Bar Chart (bar
representing means), stem
and leaf plot, Boxplot
Symmetrical
Distribution
Mean, Median, Mode,
Range, Standard Deviation
Histogram, Bar Chart (bar
representing means)
Skewed Distribution Median, Range, Interquartile
range (IQR =Q3-Q1)
Histogram, Stem and leaf
plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart
Categorical
(Ordinal, Likert Scale)
Mode, Mean, Counts,
Percentage
Pie Chart, Bar Chart
SZS2017
Histogram, Stem and Leaf OR Boxplot?
Type of Graph Advantages Disadvantages
Histogram ‒ Can graph huge data sets easily.
‒ The shape of distribution can be easily
described.
‒ You could change the intervals of the
histogram to see which gives a better
description of the data.
‒ Great for comparing data.
‒ Can show trends in the data clearly.
‒ Not good for small data set.
‒ It is difficult to simplify all
the data into one scale.
Stem and Leaf ‒ Very easy to construct.
‒ Show the real value of data
‒ Can shows range, minimum &
maximum, gaps & clusters, and
outliers easily.
‒ May observe the mode.
‒ Can identify the shape of distribution.
‒ Not good for small data set
or very large data set.
‒ Not visually appealing.
‒ Does not easily indicate
measures of centrality for
large data sets.
Boxplot ‒ Good for small or large data sets.
‒ It displays the range and distribution
of data along a number line.
‒ Can shows outliers.
‒ Original data is not clearly
shown in the box plot.
‒ Mean and mode cannot be
identified in a box plot.
SZS2017
33. 1.4.1 Outliers
Outlier is an extremely high or an extremely low data value when
compared with the rest of the data values.
Outliers can happen from:
the result of measurement or observational error,
the written or typing error,
the data value obtained from a subject that is not in the defined
population, or
the legitimate data value occurred by chance.
When a distribution is symmetric or normal, data values that are
beyond three standard deviations of the mean can be considered
as suspected outliers (refer Figure 1.11).
An outlier can strongly affect the mean and standard deviation of a
variable.
SZS2017
Recall: Other Properties of Standard Deviation
Use to determine the number of data values that fall within a
specified interval in a distribution.
The values under curve indicate the percentage of area in each
section or range of data.
It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
Position of Outliers
A data value x is an outlier if it less than the lower boundary value or
exceed the upper boundary value for the data set.
SZS2017
→ Since , thus there is no outlier.
EXAMPLE 1.11
The number of credits in business courses for eight job applicants is
shown here:
9, 12, 15, 27, 33, 45, 63, 72.
Find the first and third quartiles for the above data. Is there any
outlier on the above data?
SZS2017
2 3
1 2
1 8
4
6 7
3 6
3 8
4
1 3 1
3 3 1
13.5
2
54
2
lower boundary: 1.5 13.5 1.5(54 13.5) 47.25
upper boundary: 1.5 54 1.5(54 13.5) 114.75
x x
Q x x
x x
Q x x
Q Q Q
Q Q Q
47.25 114.75
x
34. EXERCISE 1.4.1
1. Given 19 2 1 4 3 7 5 4 6 . Find outliers if any.
2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find
outliers if any.
SZS2017
1 3
3, 6; 19 is outliers
Q Q
1 3
5, 11; 21is outliers
Q Q
MIND EXPANDING EXERCISES
14.People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
a) How many data in each group?
b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?
SZS2017
MIND EXPANDING EXERCISES
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Before After
Key: 14|1=141
SZS2017
1.4.2 Boxplots
SZS2017
The lowest value of data set (minimum)
The lower quartile Q1 (1st Quartile or 25th percentile)
The median (2nd Quartile or 50th percentile)
The upper quartile Q3 (3rd Quartile or 75th percentile)
The highest value of data set (maximum)
Outliers
Boxplot (Box and Whiskers plot) is graphical representations of a five-
number summary of a data set and outliers.
five-number
summaries
+ Outliers
35. Types of Boxplots
A Vertical boxplot
A Horizontal boxplot
SZS2017 SZS2017
EXAMPLE 1.12
SZS2017
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4
Given that for School B, 1 2 3
36, 42, 47
Q Q Q
and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
[key: 3|4 → 34]
School A School B
Minimum 24 22
1st
quartile
1 3.5 4
1 14
4
27
Q x x x
1 36
Q
2nd
quartile/
Median
7 8
2 30.5
2
x x
Q
2 42
Q
3rd
quartile
3 10.5 11
3 14
4
36
Q x x x
3 47
Q
Maximum 38 54
Outliers
1 3 1
3 3 1
1.5 27 1.5(36 27) 13.5
1.5 36 1.5(36 27) 49.5
Q Q Q
Q Q Q
Since 57 > 49.5, Thus 57 is an outlier.
no outlier
Information Obtain from a Boxplot
1. If the median is near the centre of the box, the distribution is approximately
symmetric.
2. If the median falls to the left of the centre of the box, the distribution is positively
skewed.
3. If the median falls to the right of the centre of the box, the distribution is
negatively skewed.
Suppose the median is near the centre of the box (approximately symmetric):
4. If the lines are about the same length, the distribution is approximately
symmetric.
5. If the right line is larger than the left line, the distribution is positively skewed.
6. If the left line is larger than the right line, the distribution is negatively skewed.
If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared using their central tendency (average) and
variability values.
To compare the average, use the location of the medians.
To compare the variability, use the length of the IQR.
SZS2017
36. EXAMPLE 1.12
SZS2017
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4
Given that for School B, 1 2 3
36, 42, 47
Q Q Q
and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
[key: 3|4 → 34]
School A School B
Minimum 24 22
1st
quartile
1 3.5 4
1 14
4
27
Q x x x
1 36
Q
2nd
quartile/
Median
7 8
2 30.5
2
x x
Q
2 42
Q
3rd
quartile
3 10.5 11
3 14
4
36
Q x x x
3 47
Q
Maximum 38 54
Outliers
1 3 1
3 3 1
1.5 27 1.5(36 27) 13.5
1.5 36 1.5(36 27) 49.5
Q Q Q
Q Q Q
Since 57 > 49.5, Thus 57 is an outlier.
no outlier
EXAMPLE 1.12 solution
SZS2017
Shape:
Based on the location of median, School A has right-skewed distribution where most of
teachers’ age is concentrated at the lower age (< 30 years old). However, School B has
left-skewed distribution where most of teachers’ age is greater than 42 years old.
Average:
Based on the median value, 50% of teacher at School A age less than 30.5 years old
whereas 50% of teacher at School B age less than 42 years. On average, teachers at
School B is older than the teachers at School A.
EXAMPLE 1.12 solution
SZS2017
Variability:
Based on the IQR value, for School A, IQRA = 9 years where most 50% of the teachers
age between 27-36 years old. Meanwhile, for School B, IQRB = 11 years where most
50% of the teachers age between 36-47 years. Hence, the variation of teachers’ age at
School B is higher than age of teacher at School A (IQRA < IQRB).
Range:
Without outlier, teachers’ age at school A varies less from minimum age of 24 years to
maximum age of 38 years as compared to School B with minimum age of 22 years to
maximum of 54 years.
Boxplot for Special Case
In some cases, we cannot use the general guideline as given above to interpret the
boxplot.
Boxplot is not the best graphical representation to describe a data set if the sample
size of the data set is too small.
The existence of outliers also may affect the boxplot.
Therefore, in such cases, we have to use the descriptive statistics to identify the
distribution of the data set.
SZS2017
37. EXERCISE 1.4.2 (Q1)
1. Plot a boxplot for the following data. Then describe the data.
a) 3.2, 5.9, 4.3, 6.9, 4.5, 8.0, 4.7, 8.9, 5.7, 11.9
b) 5.8, 9.7, 6.7,13.4, 6.8, 14.7, 7.2, 16.4, 8.2, 28.1
SZS2017
1 2 3
3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed
Min Q Q Q Max
1 2 3
5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed
Min Q Q Q Max
1.4.2 (Q1) solution
SZS2017
1 2 3
5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed
Min Q Q Q Max
1 2 3
3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed
Min Q Q Q Max
EXERCISE 1.4.2(Q2)
2. Two samples of ten springs made out of the steel rods supplied by
two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows. Compare
the distributions using box-plots.
Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7
8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0
11.0 11.0 11.1
Give comment on the flexibility of springs supplied by two different
companies.
SZS2017
1 2 3
1 2 3
Company A: 6.7, 7.3, 8.25, 8.8, 4.2 is outlier, 9.3, left-skewed
Company B: 9.6, 9.8, 10.15, 11.0, no outlier, 16.4, right-skewed
Min Q Q Q Max
Min Q Q Q Max
1.4.2 (Q2) solution
38. EXERCISE 1.4.2 (Q3)
3. The following Table presents viscosity (in Pascal) of chemical substance from
three (3) batches of chemical process.
Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9
a) Complete the table below by showing all the necessary calculations.
Measures of position Batch A Batch B Batch C
1st
quartile 14.30 14.10
Median 14.55 14.55
3rd
quartile 15.40 15.80
Outlier No No
b)Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape and variability.
SZS2017
3 2 1
Batch A : 15.2, right-skewed; Batch B: 15.05, no outlier, left-skewed; Batch C : 14.1, right-skewed
Q Q Q
1.4.2 (Q3) solution
12
12.5
13
13.5
14
14.5
15
15.5
16
16.5
17
Batch A Batch B Batch C
MIND EXPANDING EXERCISES
ME.15
SZS2017
MIND EXPANDING EXERCISES
15. An experiment was conducted to assess the potency of various constituents of
orchard sprays in repelling honeybees. Individual cells of dry comb were filled
with measured amounts of lime Sulphur emulsion in sucrose solution. Seven
different concentrations of lime Sulphur ranging from a concentration of 1/100
to 1/1,562,500 in successive factors of 1/5 were used as well as a solution
containing no lime Sulphur (A, B, C, D, E, F, G, H). The responses for the
different solutions were obtained by releasing 100 bees into the chamber for
two hours, and then measuring the decrease in volume of the solutions in the
various cells. Based on the figure below, answer the following questions:
a) Which concentration has outlier(s)?
b) Group the concentration according to their shape of distribution.
c) Which concentration has the most consistent data? Why?
d) Which concentration has the most variable data? Why?
e) H is the concentration of ‘no lime sulphur’. What is the use of
concentration H?
f) What conclusion can you draw from this experiment?
SZS2017
39. 1.5 NORMAL
PROBABILITY PLOT
Draw and interpret a normal probability plot.
SZS2017
Normal Probability Plots
The easiest way to check whether the sample distribution is normal or not.
The most plausible normal distribution is the one whose mean and standard deviation
are the same as the sample mean and standard deviation.
STEP 1 : Sort the data in ascending order and denote each sorted data as
, 1, , .
i
x i n
STEP 2 : Numbered the sorted data from i to n.
STEP 3 : Calculate the probability value for each xi using
0.5
i
i
p
n
.
STEP 4 : Plot pi versus xi.
If the sample points lie approximately on a straight line,
the data is approximately normally distributed.
SZS2017
Testing Normality using
Software
Other than plot manually, we can obtain it from software such as SPSS,
Minitab, Excel, and etc. The normality of the data also can be tested by
using Kolmogorov Smirnov, Anderson Darling or Shapiro-Wilk Tests.
SZS2017
EXAMPLE 1.13
→ The graph pi versus xi from the
figure above is known as the
normal probability plot. Since the
data lies approximately on a
straight line, the data is normally
distributed.
SZS2017
40. EXERCISE 1.5
1. A sample of size six is drawn. The sample, arranged in
increasing order, is
3.01 3.35 4.79 5.96 7.89 9.15
Do these data appear to come from an approximately normal
distribution?
2. The data shown represent the number of movies in America for
14-year period.
2084 1497 1014 910 899 870 859
848 837 826 815 750 737 637
Do these data appear to come from an approximately normal
distribution?
SZS2017
1) yes 2) no
1.5 (Q1) solution
SZS2017
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10
1.5 (Q2) solution
SZS2017
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
0 500 1000 1500 2000 2500
Pi
xi
CONCLUSION
• The applications of statistics are
many and varied. People
encounter them in everyday life,
such as in reading newspapers or
magazines, listening to the radio,
or watching television.
• By combining all of the
descriptive statistics techniques
discussed in this chapter
together, the student is now able
to collect, organize, summarize
and present data.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017
41. REFERENCES
1. Walpole R.E., Myers R.H., Myers S.L. & Ye K. 2011. Probability and Statistics for Engineers
and Scientists. 9th Edition. New Jersey: Prentice Hall.
2. Navidi W. 2011. Statistics for Engineers and Scientists. 3rd Edition. New York: McGraw-Hill.
3. Triola, M.F. 2006. Elementary Statistics.10th Edition. UK: Pearson Education.
4. Bluman A.G. 2009. Elementary Statistics: A Step by Step Approach. 7th Edition. New York:
McGraw–Hill.
5. Weiss, N.A. 2002. Introductory Statistics. 6th Edition. United States: Addison-Wesley.
6. Sanders D.H. & Smidth R.K. 2000. Statistics: A First Course. 6th Edition. New York: McGraw-
Hill.
7. Crawshaw, J. & Chambers,J. 2001. A Concise Course in Advance Level Statistics with Work
Examples, 4th Edition, Nelson Thornes.
8. Satari S. Z. et al. Applied Statistics Module New Version. 2015. Penerbit UMP. Internal used.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017