SlideShare a Scribd company logo
1 of 41
Download to read offline
CHAPTER 1
INTRODUCTION TO STATISTICS
Expected Outcomes
 Able to define basic terminologies of statistics.
 Able to apply the basic steps in the statistical problem-solving
methodology for various applications.
 Able to summarise and analyse data using measures of central
tendency, measures of variation and measures of position.
 Able to relate the concept of accuracy and precision of data using game
of darts.
 Able to conduct exploratory data analysis that includes numerical data
analysis and various graphical displays.
 Able to plot and interpret normal probability plot.
SZS2017
CONTENT
1.1 Statistical Terminologies
1.2 Statistical Problem Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.2.1 Accuracy and Precision
1.3.3 Measures of Position
1.3.4 Descriptive Statistics Using Microsoft Excel
1.4 Exploratory Data Analysis
1.4.1 Outliers
1.4.2 Box Plot
1.5 Normal Probability Plot
SZS2017
1.1 STATISTICAL
TERMINOLOGIES
 Define the meaning of statistics, population,
sample, parameter, statistic, descriptive statistics
and inferential statistics.
 Discuss the importance of statistics in daily lives.
SZS2017
1.1.1 What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
 Ten thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
 The death rate from lung cancer was 10 times higher for smokers compared
to nonsmokers.
 The average cost of a wedding is nearly RM10,000 in Malaysia.
 In Malaysia, the median salary for men with a bachelor’s degree is
RM 30,000 per year, while the median salary for women with a bachelor’s
degree is RM 29,000 per year.
 Globally, an estimated of 500,000 children under the age of 15 live with Type
1 diabetes.
 Women who eat fish once a week are 29% less likely to develop heart disease.
SZS2017
What is Statistics?
 The sciences of conducting studies to collect, organise, summarise,
analyse, present, interpret and draw conclusions from data.
Any values (observations or measurements) that have been collected
 Collection and analysis of data are the most important part in research
methodology.
 Researchers must have a basic knowledge of statistics before starting any
research or study involving data analysis.
 Statistics is also used to analyse the results of surveys and as a tool in
scientific research to make decisions based on controlled experiments,
estimation, prediction, and quality control.
SZS2017
 Basic knowledge of statistics is needed in any disciplines or any field of
research or study (in almost all fields of human endeavour) that involve data
analysis.
 The methods of statistics allow the researchers to design a valid experiment
and finally draw a reliable conclusion or interpretation from the data they
produced and analysed.
Examples:
In sports, statistician may keep records of the number of successful kicks a
team scored during a football season.
In public health, a doctor might be concerned with the number of child who
are infected with a H1N1 virus during a certain year.
In education, an educator might want to know if the performance of
students in current semester are better than the previous semester.
1.1.2 Why we Need Statistics?
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:
1. Describing the relationship between variables.
a. A university admission director needs to find an effective way of
selecting students. He designed a statistical study to see if there is a
significance relationship between SPM result and the GPA achieved by
first year students at his university. If there is a strong relationship,
high SPM result will become an important criterion for admission.
b. A management consultant wants to compare a client’s investment
return for this year with related figures from last year. He summarises
the revenue and cost data from both periods and find the relationship
between these two variables. Based on his findings, he presents his
recommendations to his client.
Variables is a characteristic or attribute that can assume different values. These
values are data. It is called random variables if the values are determined by chance.
SZS2017
1.1.2 Why we Need Statistics?
Knowledge of statistics may help you in:
2. Making better decision in the face of uncertainty.
a. Suppose that a manager of Unisex Hair Stylist claimed that 90% of the
customers are satisfied with the services. If a consumer activist feels
that this is an exaggerated statement that might require legal action,
the activist can use statistical inference techniques to decide whether
or not to sue the manager. Therefore, the knowledge gained from
studying statistics can enhance the awareness towards becoming
better consumers.
b. People can make intelligent decisions about what products to purchase
based on consumer studies about government spending based on
utilisation studies, and so on.
SZS2017
1.1.3 Population and Sample
Population (N)
A complete collection of
measurements, outcomes, objects or
individuals under study.
Tangible
finite and the total number of
subjects is fixed and could be listed
→ all computers in a room, all female
students in a university, or all electrical
components manufactured in a day, etc.
Conceptual (Intangible)
all values that might possibly have
been observed and has an unlimited
number of subjects.
→ simulated data from computer or
instrument, number of germs on human
body, all experimental data such as all
measurements of length of metal rod, etc.
Sample (n)
A subset of the population that
is observed
SZS2017
Parameter and Statistic
Parameter
A numerical value that represents a
certain population characteristic
Statistic
A numerical value that represents a
certain sample characteristic
 The average of weight for a sample of
female students selected from all students in
a university
 The percentage of defective components in
a sample of 100 electrical components
 The average of weight of students from a
population of students in a university
 The percentage of defective components in
a population of electrical components
manufactured in a day
Measurement Parameter Statistic
Mean (Average)
Variance
Standard deviation
Proportion
 x
2
 2
s
 s
 p
SZS2017
EXAMPLE 1.1
A travel agent claims that the average number of rooms in large hotels in
Pahang is 500 and the standard deviation is 165. A sample of seven hotels in
Genting Highlands is selected and the average number of rooms is found to be
435 with standard deviation of 15.
Based on the above example:
 The population under study is all large hotels in Pahang.
 The sample selected is seven large hotels in Genting Highlands.
 The population under study is tangible since there are finite numbers of
large hotels in Pahang.
 The characteristic (variable) is number of rooms.
 The parameters are   500 and 𝜎 = 165 since they describe the
population characteristics.
 The statistics are ҧ
𝑥 = 435 and s = 15 since they describe the sample
characteristics.
SZS2017
EXERCISE 1.1.3
The number of first year students at a residential college is 317 students. An IQ
pre-test is given to all of them in their first week. The dean of admission
collected data on 27 of them and found their mean score on the IQ pre-test was
51. The mean for the entire first year students was estimated to be
approximately 51. A subsequent computer analysis of all first year students
showed that the true mean (population mean) is 52.
Based on the above statement, answer the following questions.
a) What is the population?
b) Is the population tangible or conceptual?
c) What is the sample?
d) What is the variable of the study
e) Which number describes a parameter?
f) Which number describes a statistic?
SZS2017
1.1.4 Descriptive and
Inferential Statistics
Descriptive statistics
 Includes the process of data collection,
data organisation, data classification,
data summarisation, and data
presentation obtained from the sample.
 Used to describe the characteristics of
the sample.
 Used to determine whether the sample
represents the target population by
comparing sample statistic and
population parameter.
Inferential statistics
 Involves a process of generalisation,
estimations, hypothesis testing, predictions
and determination of relationships between
variables.
 Used to describe, infer, estimate,
approximate the characteristics of the target
population.
 Used when we want to draw a conclusion
for the data obtain from the sample.
EXAMPLE:
Ten thousands parents in Malaysia have
chosen Takaful Insurance as their
trusted life insurance agency.
EXAMPLE:
The death rate of lung cancer was 10 times
higher for smokers compared to
nonsmokers .
SZS2017
Overview of descriptive
and inferential statistics
SZS2017
EXERCISE 1.1.4
In the statements below, decide whether the statements describe the
descriptive statistics or inferential statistics.
a) The average cost of a wedding is nearly RM10,000.
b) In Malaysia, the median salary for men with a bachelor’s degree
is RM 30,000 per year, while the median salary for women with a
bachelor’s degree is RM 29,000 per year.
c) Globally, an estimated of 500,000 children under the age of 15
live with Type 1 diabetes.
d) A researcher claims that a new drug will reduce the number of
heart attacks in men over 70 years of age.
SZS2017
1.1.5 Role of the Computer in Statistics
Two software tools commonly used for data analysis:
1. Spreadsheets
 Microsoft Excel & Lotus 1-2-3
2. Statistical Packages
 AMOS, eViews, MINITAB, R, SAS, SmartPLS,
SPSS and SPlus
SZS2017
Data Analysis Application Tools in EXCEL
1. Graph and chart
2. Formulas
3. Data Analysis Tools:
File → Options → Add-Ins
→ Analysis ToolPak → ok
→ Data → Data Analysis
SZS2017
Chose
Analysis
ToolPak
and click
Go
SZS2017
Tick Analysis
ToolPak
and click ok
SZS2017
→ Now we can use the Data Analysis
Application in Microsoft Excel to analyse data.
SZS2017
1.2 STATISTICAL
PROBLEM- SOLVING
METHODOLOGY
 Outline the six basic steps in the statistical
problem-solving methodology.
 Identify various sampling methods.
 Classify type of data and level of measurement.
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
Statistical Problem-Solving
Methodology
SZS2017
1.2.1 Identify the Problem or Opportunity
The researchers must clearly understand and define the objective of the study
before conducting any research. Possible questions that could be asked before
starting any study are given as follows.
 What are the problem and objective of the study?
 What are the possible variables that are related to the study?
 Can the study goal be achieved through simple counts or measurements of
the group?
 What are possible treatments should be imposed on the group and what are
their responses?
 Should the experiment be performed on the group?
 Do the data come from population or sample?
 If samples are needed, how large the sample size is appropriate? How
should they be taken?
SZS2017
Characteristics of Sample
 A sample is a subset of population.
 The population is a complete group of people, companies, hospitals,
stores, university, students, and etc., that share some set of
characteristics.
 A census involves the whole population which possesses a greater
likelihood of non-sampling errors.
 Sampling error is calculated when the statistical characteristics of a
population are estimated from a subset, or sample, of that population.
The difference between the sample and population values is considered as
a sampling error.
 Non-sampling errors is an error that are not due to sampling. As example,
in a survey, mistakes may occur in the selection of people.
SZS2017
Characteristics of Sample Size
 The larger the sample size, the smaller the magnitude of sampling errors
would be.
 Studies using survey method need a larger sample size since the survey is
a voluntarily based.
 Studies using mail response need a much larger sample size. Normally,
the response is as low as 20%-30% responses.
 The ideal sample size in a study should be large enough to serve as an
adequate representative of the population in order to generalise the
overall population.
 The optimal sample size depends on statistical distribution used and for
the purpose of generalisation to the whole population.
 Researcher may refer to Krejcie and Morgan (1970) as a guideline to
obtain an adequate sample size.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
 Data must be collected as complete as possible, accurate & relevant to the
problem in order to solve the problem.
 Data could be obtained in 3 ways:
1) Data that are made available by others (internal, external, primary or
secondary data)
 It is similar to historical or observed data.
 The availability of the data depends on the primary and secondary
resources of document, evidence that includes interviews, observation
method, minutes of meeting, formal policy statement etc.
 Example: Rainfall data collected from Malaysian Meteorological
Department is a secondary data.
SZS2017
1.2.2 Deciding on the
Method of Data Collection
 Data could be obtained in 3 ways:
2) Data resulting from an experiment (experimental study):
 In an experimental study, the researcher manipulates one of the
variables and study on how the manipulation influences other variables
provided that the treatment and the subjects are assigned to groups
randomly.
 Example: Blood glucose level data obtained from diabetic patients
before and after a treatment is an example of experimental data.
3) Data collected in an observational study (observation, survey,
questionnaire):
 Observations VS interviews
SZS2017
Observation method
 In qualitative research: used to study the behaviours or events and the
context that surrounds the behaviours or events and between the behaviour
and the event.
 In quantitative research: used to collect data regarding the number of
occurrences in a specific period of the time, or duration of a very specific
behaviour or event.
 The detail descriptions or data collected in qualitative research can be
converted later to numerical data and can be analysed quantitatively.
 Observations method can be used in setting the physical environment, social
interactions, physical activities, non-verbal communications, planned and
unplanned activities.
 Example: A study on customer’s behaviour towards type of brands in a
certain shopping complex is an example of observational study.
SZS2017
Interviews method
 The purpose of interview in collecting data is to find out what is in or on
someone else’s mind.
 Interview data can easily become biased and misleading if the interviewed
person is aware of the perspective of the interviewer.
 It is very important to make sure the person being interviewed does not
hold any preconceived notions regarding the outcome of the study.
 Interviews range from quite informal and completely open-ended to very
formal with the questions predetermined and asked in a standard manner.
 Usually, interviews are used to gather information regarding an individual’s
experience and knowledge; his/her opinions, beliefs, and feelings, and
demographic data.
 Example: An interviewer is interested to gather information on the way
nurses organise their care in hospital wards and conduct an interview
session.
SZS2017
Other Methods of Data Collection
• Questionnaires and surveys (Quantitative + Qualitative).
• Opinions (Qualitative + Quantitative).
• Projective technique and psychological tests (both).
• Proxemics – Study of people’s use of space and their relationship to
culture.
• Kinetics – Study of body movement or people communicate
nonverbally.
• Street Ethnography – Concentrate on a person becoming a part of
the place under study.
• Narratives – Study people’s individual life stories.
• Triangulation – The used of multiple data collection techniques
(Triangulation of data permits the verification and validation of
qualitative data.
SZS2017
EXERCISE 1.2.2
Identify each of the following studies as being either observational or
experimental.
a) Subjects were randomly assigned to two groups, and one group was
given a herb and the other group a placebo. After 6 months, the
numbers of respiratory tract infections each group were compared.
b) A researcher stood at a busy intersection to see if the colour of an
automobile a person drives is related to running red lights or not.
c) A researcher finds that people who are more hostile have higher
total cholesterol levels than those who are less hostile.
d) Subjects are randomly assigned to four groups. Each group is
placed on one of four special diets—a low-fat diet, a high-fish diet, a
combination of low-fat diet and high-fish diet, and a regular diet.
After 6 months, the blood pressures of the groups are compared to
see if diet has any effect on blood pressure or not.
SZS2017
1.2.3 Collecting the Data
(Sampling Techniques)
 Sampling is a process of selecting few samples from a population to
become the basis for estimating or predicting the prevalence of an
unknown piece of information, situation or outcome regarding the
bigger group.
i. Non-probability sampling (judgment, voluntary, convenience):
• Sample collected based on the judgment of the experimenter.
• Resulting samples might be biased.
ii. Probability sampling (random, systematic, stratified, cluster):
• The chances is known before the sample is picked.
• Resulting samples are unbiased.
 Each collected data from a sampling process can be classified either as
a non-probability data or probability data.
SZS2017
Sampling
Techniques
Nonprobability
sampling
Judgment
Voluntary
Convenience
Others
Snowball
Quota
Probability
sampling
Random
Systematic
Cluster
Stratified
Others
Multi-stage
K-Sampling
Nested
SZS2017
A. Nonprobability Sampling Methods
Non-probability Sampling Methods Example
Judgment sampling
Data is selected based on opinion of one or
more experts.
A political campaign manager intuitively
picks certain voting districts as reliable
places to measure the public opinion of his
candidates.
Voluntary sampling
Questions are posed to the public by
publishing them over radio or television via
phone, short message, email etc. The
resulting sample tends to over represent
individuals who have strong opinions.
A call-in radio show asks their listeners to
participate in surveys on controversial
topics such as abortion, affirmative action,
gun control, politic, etc.
Convenience sampling
The data selected is an “easy sample”,
haphazard or accidental sampling.
The researcher obtains units or people who
are most conveniently available.
A surveyor will stand in one location and
ask passerby the questions.
SZS2017
B) Probability Sampling Methods
1. Random sampling
• Each data is numbered, and then the
data is selected using chance or
random method such as random
number.
• When a sample is chosen at random,
it is said to be an unbiased sample.
• Random sample can be selected with
or without replacement.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her
university. There are 5000 students enrolled at the university, and he/she wants to draw a
sample of size 100 to take a physical fitness test.
She could obtains a list of all 5000 students, numbered it from 1 to 5000 and then
randomly invites 100 students corresponding to those numbers to participate in the study.
SZS2017
Generating Random Number
• Generating random number is an important step in obtaining
random sample.
• In random number, each number has equal chance to be selected.
• Random number can be generated from calculator, softwares, or
random number table.
• As example, suppose we have data numbered from 1 to 100 and
we want to choose five samples only. Hence, using R-language we
can use the R command “sample (1: 100, 5)”. The resulted output is
the five number listed randomly.
SZS2017
B) Probability Data Samples
2. Systematic sampling
• A set of data is numbered from 1 to N .
• The first data is selected randomly within
number 1 and k where k=N/n and n
sample size.
• The next number are selected every k
interval to produce n samples.
Example:
Suppose a lecturer wants to study the physical fitness levels of students at his/her university
and he/she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list
of all 5000 students, numbered it from 1 to 5000 and randomly picks one of the first 50 voters
(k=5000/100) on the list. If the first picked number is 30, then the 30th student in the list
should be invited first. Then she should invite every 50th name on the list after this first
random number starts (the 80th student, the 130th student and so on) to produce 100 samples
of students to participate in the study.
 
1 2
, , , N
x x x
SZS2017
B) Probability Data Samples
3. Stratified sampling
• The population is divided into groups
according to some characteristic that is
important to the study, and then the sample
is selected from each group using random or
systematic sampling.
• The characteristics are homogeneous
(similar) within each group but
heterogeneous (dissimilar) among the groups
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between male and female students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
The random method or systematic method can be used to select the participants. As an
example, she use random sample to choose 50 male students and use systematic method
to choose another 50 female students or otherwise.
SZS2017
B) Probability Data Samples
4. Cluster sampling
• The population is divided into groups or
clusters, then some of those clusters are
randomly selected and all members from
those selected clusters are chosen.
• Cluster sampling can reduce cost and time.
• Each cluster has heterogeneous
characteristic but has homogeneous
characteristic among the clusters.
• We can choose more than one cluster.
Example:
Assume that, because of different lifestyles, the level of physical fitness is different
between 1st year, 2nd year, 3rd year and senior students. To account for this variation in
lifestyle, the population of student can easily be clustered into four categories.
Then, she can choose any clusters and chose all students in that clusters as the
participants. For example, all 2nd year students are chosen as the participants.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Judgement
Sampling
When the population
is too large.
- Fast and conclusive. - Biased since it based on
opinion of one or more
expert only.
Voluntary
Sampling
When the members
of the population are
convenient to be
sampled.
- Fast response.
- Easy to obtain lager
sample sizes.
- Samplings are too
random.
- Sometimes not reliable.
- Degree of generalisability
is questionable.
Convenience
Sampling
When the members
of the population are
convenient to be
sampled.
- Fast and easy.
- Convenience and
inexpensive.
- Samplings are too
random.
- Sometimes not reliable,
- Degree of generalisability
is questionable.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Random
Sampling
When the members of
the population are
similar to one another
on important
variables.
- Use table of random
number.
- Each data has an equal
chance to be selected.
- Ensures a high degree of
representativeness.
- High cost.
- Time consuming for large
sample size.
- Tedious.
Systematic
Sampling
When the members of
the population are
similar to one another
on important variables
- Relatively easy to
construct, execute,
compare and understand.
- The process can be
controlled.
- Good for tight budget
research.
- Ensures a high degree of
representativeness.
- No need to use a table of
random number.
- There is a risk of data
manipulation.
- Not the best method if the
researcher does not know
the background of the
population.
- Less random than simple
random sampling.
SZS2017
Advantages and Disadvantages for each
Sampling Techniques
Sampling
Techniques
When to Use? Advantages Disadvantages
Stratified
Sampling
When the population
is heterogeneous and
contains several
different groups, some
of which are related to
the topic of the study.
- Variety of samples.
- Ensures a high degree of
representativeness of all
the strata or layers in the
population.
- Time consuming.
- Tedious.
Cluster
Sampling
When the population
consists of units rather
than individuals.
- Less energy and money.
- Easy and convenient.
- Save time.
- Possibly, members of units
are different from one
another, decreasing the
techniques effectiveness.
SZS2017
Random Data Generation
From Normal Distribution
𝑋~𝑁 𝜇, 𝜎2
𝑜𝑟 𝑍~𝑁 0, 1
𝜇 is mean
𝜎2
is variance
SZS2017
Random Data Generation
From Poisson Distribution
X~Po λ , λ is average
value
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
a) Suppose a researcher has a list of 1000 registered voters in a
community and he wants to pick a probability sampling of 50 samples.
He uses a random number table to pick one of the first 20 voters
(1000/50 = 20) on the list. The table gave him the number of 16, so he
selects the 16th voter on the list as the first selected number. Then he
picks every 20th name after the first random number start (the 36th
voter, the 56th voter, etc.) until 50 samples obtained.
b) In a consumer survey of large cities, a researcher divides a map of the
city into small blocks. Each block containing a cluster is surveyed. A
number of clusters are selected for the sample, and all the households
in a cluster are surveyed. Less energy and money are expended if an
interviewer stays within a specific area rather than traveling across
stretches of the cities.
SZS2017
EXERCISE 1.2.3
In each of these statements, identify the type of sampling method used.
c) Researchers or farm managers may be called in when a crop shows a certain
growing pattern or when surface differences are observed for a soil. For
example, differences may occur in soil color which may be the result of many
factors. A researcher is called to judge a particular shade of colour to be
typical for a sample at certain sites. Then from these sites, samples are
drawn.
d) The population of university professors is divided into groups according to
their rank (instructor, assistant professor, etc.) and several are selected from
each group to make up a sample.
e) A surveyor stands outside a shop in the East Cost Mall and randomly selects
people to participate in a quiz.
f) A quality engineer wants to inspect rolls of wallpaper in order to obtain
information on the rate at which flaws in the printing are occurring. She
decides to draw a sample of 50 rolls of wallpaper from a day’s production. At
the end of each hour, for 5 consecutive hours, she takes the 10 most
recently produced rolls and counts the number of flaws on each.
SZS2017
MIND EXPANDING EXERCISES
1. Statistics can be applied across many disciplines or any fields of
research and almost in all fields in human endeavour. Based on this
statement, suggest reasons why statistics is important.
2. Is a large sample necessarily a good sample? Why or Why not?
3. Suppose you have been hired by a radio station in Malaysia to
determine the age distribution of their listeners. Describe in detail
how you would select at least 3000 sample of listeners. Chose the
best sampling techniques and state the reason. The sampling
techniques can be mix or combine.
SZS2017
 In this step, the collected data are organised properly for further study and
investigation.
 Data that has been collected during the sampling process is called raw data.
 The simplest way to organise raw data systematically is by using data array.
Data array is an arrangement of data items in either ascending or
descending order (sorting).
1.2.4.1 Classifying
 identify items with the same characteristics & arranging them into
groups or classes.
 Data could be classified by its type or by its level of measurement.
1.2.4.2 Summarisation
 Graphical & Descriptive statistics ( tables, charts, measures of central
tendency, measures of variation, measures of position)
1.2.4 Classifying and Summarising
the Data
SZS2017
Example of Raw Data
Data can be organised
by column or row
SZS2017
1.2.4.1 Data Classification
 Data are the values that variables can assume.
 Variables is a characteristic or attribute that can assume different values.
 Variables whose values are determined by chance are called random
variables.
Data can be
classified
By how they are categorized, counted
or measured
- Level of measurements of data
As Quantitative or
Qualitative type
SZS2017
Qualitative
(categorical/Attributes)
 Data that refers to
classification name according
to some characteristic or
attribute
 Data is classified using code
numbers
Quantitative (Numerical)
 Data can be counted or
measured
 Data can be ordered or ranked
Nominal Data
The values cannot be ranked
Gender, race, citizenship,
colour, etc.
Ordinal Data
The values can be ranked and
likert scale is used
Feeling (dislike-like),
colour (dark-bright), etc.
Discrete Data
The values can be counted and finite
Number of student, number of cat,
number of defect, etc.
Continuous Data
The values can be placed within two
specified values, obtained by measuring,
have boundaries, and shall be rounded to
require decimal places
Weight, age, salary, temperature, etc.
Use code
numbers
(1, 2,…)
Type
of
Data
SZS2017
Levels of Measurement of Data
Levels Descriptions Examples
Nominal-level Classifies data into mutually
exclusive (non-overlapping),
exhausting categories in which
no order or ranking can be
imposed on the data.
Zip code (4, 5, 6,…),
Post code (25000, 25600, …),
Gender (female, male),
Eye colour (blue, brown, green, hazel),
Political affiliation, Religion,
Nationality
Ordinal-level Classifies data into categories
that can be ranked; however, any
specific differences between the
ranks do not exist.
Grade (A, B, C, D, etc.),
Judging (first place, second place, etc.),
Rating scale (poor, good, excellent).
Color (light blue, …, dark blue)
Interval-level Ranks the data, and precise
differences between units of
measure do exist; however, there
is no meaningful zero.
IQ test
Temperature
Shoe size
Ratio-level Possesses all the characteristics
of interval measurement, and
there exists a true zero.
Height, Weight, Time, Salary
SZS2017
1. The SuperMotor Marketing Corporation has asked you for information
about the car you drive. For each question, identify each of the types of data
requested as either attribute data or numeric data. When atribute data is
requested, identify the variable either as nominal or ordinal. When
numeric data is requested, identify the variable either as discrete or
continuous. Then, identify the level of measurement for each variable.
a) What is the weight of your car?
b) In what city was your car made?
c) How many people can be seated in your car?
d) What is the distance traveled from your home to your school?
e) What is the color of your car?
f) How many cars are in your household?
g) What is the length of your car?
h) What is the normal operating temperature (in C) of your car’s engine?
i) How much does the petrol mileage (km/l) do you get in city driving?
j) Who made your car?
k) How many cylinders are there in your car’s engine?
l) How many kilometres have you put on your car’s current set of tyres?
EXERCISE 1.2.4.1
SZS2017
2. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.
a) What are the variables under study?
b) Categorise each variable either as qualitative or quantitative.
c) Categorise each quantitative variable either as discrete or
continuous.
d) Categorise each qualititative variable either as nominal or ordinal.
e) Identify the level of measurement for each variable.
Type of transportation
Industries
Number of job related
injuries
Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950
EXERCISE 1.2.4.1
SZS2017
1.2.4.2 Data Summarisation
1) Descriptive statistics (refer Section 1.3)
 Typically used to confirm conjectures about the data.
 Quantitative data: measures of central tendency, measures of
variation (dispersion) and measures of position.
 Qualitative data (non-numeric quality (attribute) or category):
measure the relative frequency for a particular characteristic
and calculate its percentage.
b) Graphical Summary
 Organise the data in some meaningful way by constructing a
frequency distribution (refer Appendix A.1) for quantitative or
qualitative data.
 A frequency distribution is the organisation of raw data in
table form, using classes and frequency
SZS2017
Graphical Statistics
The purpose of graphs in statistics is to convey the data to the viewer in pictorial
form and getting the audience’s attention in a publication or a presentation.
Histogram Frequency Polygon Ogive Bar Chart
Pareto Chart Pie Chart Time Series Graph
SZS2017
Histogram, Frequency
Polygon, Ogive
Histogram
 For quantitative data.
 Describe grouped
frequency data
distribution.
 Displays the data by using
contiguous vertical bars of
various heights to represent
the frequency of the classes.
Frequency Polygon
 For quantitative data.
 Describe grouped frequency
data distribution.
 Displays the data by using
lines that connect points
plotted for the frequencies at
the midpoints of the classes.
 The frequencies are represented
by the heights of the points.
Ogive
 For quantitative data.
 Represents the cumulative
frequencies for the classes in a
grouped frequency data
distribution.
 Visually represent how many
values are below a certain upper
class boundary.
Distribution Shapes for Histogram
Bell-Shaped Uniformed J-Shaped Reverse J-Shaped
Right Skewed Left Skewed Bimodal U-Shaped
SZS2017
Bar Chart, Pareto Chart,
Pie Chart
Bar Chart
 For quantitative data, the bar
represents the mean values.
 For qualitative data, the bar
represents the heights or length
whose represents the
frequencies of the data.
 The bars can be vertical or
horizontal.
Pareto Chart
 Used to represent a frequency
distribution for a categorical
variable.
 The frequencies are displayed
by the heights of vertical bars
which are arranged in
decreasing order.
Pie Chart
 A circle that is divided into
sections or wedges according
to percentage of frequencies in
each category of the
distributions.
 Pie charts show the relationship
between classes in a set of data
with the whole data.
Stem and Leaf Plot, Time
series graph
Time Series Graph
 Represents data that occur over
a specific period of time.
 For analysis, we look at the
trend or pattern (increasing or
decreasing) that occurs over the
time period.
 Further analysis will look at the
slope or the steepness of the line
(rapid increase or decrease).
Stem and leaf plot
 The leading digit is plotted as the stem and the trailing digit as the leaf to
form groups or classes.
 A key indicator is used to define the stem and leaf values.
 If the plot is rotated in horizontal position, we can see the shape of the
data distribution
 For a mixture stem and leaf plot, the shape of distribution for the left side
may be seen by reflecting the plot to the right side.
 We may analyse the variability of the data by looking at the spread of the
stem and leaf plot.
 A stem and leaf plot is also good in showing the range, minimum,
maximum, mode, gaps, clusters, and outliers.
Selection of appropriate statistical
techniques for data summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative
(ratio scale)
Mean, Median, Mode,
Range, Standard Deviation,
Interquartile range (IQR
=Q3-Q1)
Histogram, Bar Chart (bar
representing means), stem
and leaf plot, Boxplot
Symmetrical
Distribution
Mean, Median, Mode,
Range, Standard Deviation
Histogram, Bar Chart (bar
representing means)
Skewed Distribution Median, Range, Interquartile
range (IQR =Q3-Q1)
Histogram, Stem and leaf
plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart
Categorical
(Ordinal, Likert Scale)
Mode, Mean, Counts,
Percentage
Pie Chart, Bar Chart
SZS2017
1.2.5 Presenting and
Analysing the Data
 Analysed information given by the
 Descriptive statistics (refer topic 1.3)
 Graphical summary (graph and chart)
 Identify if there exist any relationship in the variables under
study.
 Making any relevant statistical inferences
 confidence interval, hypothesis testing, ANOVA, goodness of fit
test, contingency table, regression, correlation, etc.
SZS2017
BASIC INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Confidence Intervals
(CHAPTER 2)
An estimated range of values which is likely to include an unknown population
parameter, 𝜃 with a specified probability (confidence level) within that interval.
The interval is usually written as 𝒂, 𝒃 or 𝒂 < 𝜽 < 𝒃.
Hypothesis Testing
(CHAPTER 3)
A statement (claim or conjecture or assertion) concerning a parameter or
parameters of one or more populations.
• Statistical Analysis for one population (mean, variance, proportion)
• Statistical Analysis for two populations (mean, variance, proportion)
Analysis of Variance
(ANOVA)
(CHAPTER 4)
Statistical Analysis for three or more populations mean
• One-way ANOVA
• Two-way ANOVA and Post Hoc Test
Linear Regression
Analysis
(CHAPTER 5)
A statistical measure that attempts to determine the strength of relationship
between dependent (y) and independent variables (x).
• Simple linear regression analysis and correlation. (y vs x)
• Multiple linear regression analysis and correlation. (y vs xi)
• Model selection technique to chose a parsimony model that best fit the data.
Statistical Analysis for
Categorical Data
(CHAPTER 6)
1. Tests concerning frequency distributions for categorical data
(Goodness of Fit)
2. Tests concerning specific probability distributions (Goodness of Fit)
3. Test the Independence of two variables (Contingency Table)
4. Test the homogeneity of proportions (Contingency Table)
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Experimental
Design (DOE)
Planning, conducting, analysing and interpreting controlled tests to evaluate the factors
that control the value of a parameter or group of parameters.
Example: ANOVA, Single factor experiment, Randomized Blocks, Latin Squares and
Related Design, Factorial Design, Response Surface Methodology, Nested and Split-Plot
Design
Time Series
Analysis
Modelling, making inference and producing forecast time series data for future
observations. Time series models are built to represent the serially correlated series,
trends, or seasonal effects.
Example: Linear Time Series, Linear Stationary Models (AR, MA, ARMA), Linear
Nonstationary Models (ARIMA, SARMA), Box-Jenkins Models, Volatility Models (ARCH,
GARCH), Hybrid models
Multivariate
Analysis
A central tool whenever many variables need to be considered at the same time.
Example: Mean Vector and Covariance Matrix Estimation, MANOVA, Principal
Component Analysis, Factor Analysis, Canonical Correlation Analysis, Discriminant
Analysis, Cluster Analysis
Statistical Quality
Control (SQC)
Quality improvement through the use of modern statistical methods for quality control
Example: Variables control charts, Attribute Control Charts, Time-Weighted Control
Charts, Multivariate Control Charts
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Statistical
Modelling
A mathematical equations that relate one or more random variables and possibly
other non-random variables, concerning the generation of some sample data and
similar data from a larger population.
• Example of Statistical Models: Generalised Linear Model, Dependence model,
Regression, Bayesian, markov chain, Random effect and mixed model
• The Process involve: parameter estimation, data generation, missing values,
outlier detection, simulation study, bootstrap, goodness of fit test
Data Mining A computing process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database system.
Example: Decision Tables, Decision Trees, Classification Rules, Association Rules,
Decision Tress, Clustering, Advanced linear model, Bayesian, Instance-based Learning
Circular Statistics A branch of statistics that involve circular data which deal with direction or cyclic
time. Circular data are measured in degrees (0,2π] or radian (0o, 360o].
Example: orientation of an animal, direction of wind and wave, days of the week,
compass direction, waves of sound, the human perception under various conditions,
the orientation of ridges of fingerprints, the orientation of sand grains from a beach,
the death due to a disease at various times in a year, and astronomical observations.
ADVANCED INFERENTIAL STATISTICS
Statistical Analysis Characteristics
Advanced Regression
Analysis
• Polynomial Regression: y is modelled as an nth degree polynomial in x
• Multivariate Regression: Y is a matrix with series of multivariate dependent
measurements and X is a matrix of observations on independent variables.
• Generalized Linear Model: A flexible generalization of ordinary linear
regression that allows for response variables that have error distribution
models other than a normal distribution.
• Logistic Regression: A regression model where the dependent variable is
categorical.
• Nonlinear Regression: The observational data are modeled by a function
which is a nonlinear combination of the model parameters and depends on
one or more independent variables
• Error in Variables: a regression model that account for measurement errors
in the independent variables.
1.2.6 Make the decision
and conclusion
 The researchers can make decisions in order to achieve the
objective and goal of the research and choose the best options
which represents the ‘best’ solution to the problem.
 The correctness of this choice depends on the analytical skill of
the researchers and quality of the information.
SZS2017
1.3 REVIEWS ON
DESCRIPTIVE
STATISTICS
 Summarise the data using measures of central
tendency, such as the mean, median, mode, and
midrange.
 Describe the data using measures of variation, such
as the range, variance, standard deviation and
coefficient of variation.
 Identify the position of a data value in a data set
using measures of position such as quartiles, deciles,
and percentiles.
SZS2017
Reviews on
Descriptive Statistics
 Descriptive statistics is typically used to confirm conjectures
about the data.
 We can summarise data using measures of central tendency,
measures of variation, and measures of position.
 Some classified these type of measures as traditional
statistics.
 If the measurement describes about a population
characteristic, it is called a parameter.
 If the measurement describes about a sample characteristic,
it is called a statistic.
SZS2017
RULE OF THUMB FOR DECIMAL
PLACES
1. In general, the calculated parameter or statistic value should
be rounded to four (4) decimal places.
2. If the unit is given (in cm, minute, day, etc.), the value should
be rounded to that unit’s decimal places.
SZS2017
TIPS: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
 as population
standard deviations, and 1
n
x  or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
1.3.1 Measures of Central Tendency
 Measures of central tendency are also called measures of
average
1. mean
2. median
3. mode, and
4. midrange.
 The measures of central tendency are use to describe an
entire set of observations with a single value representing the
central or middle value of the data set.
Can roughly describes
the shape of
distribution of a
certain data set
SZS2017
Is a rough estimate of the middle
lowest value highest value
MR
2


Midrange (MR)
EXAMPLE 1.3:
If the data set is 1, 3, 5, 7, 7, 8, then the calculated midrange is,
1 8
4.5
2
MR

  .
Properties of Midrange
 A rough estimate of the average
 Can be affected by one extremely high or low value (outlier).
SZS2017
Mean
Is the sum of the values divided by the total number of values
Population Mean Sample Mean
1
, population size
N
i
i
x
N
N
 


1
, sample size
n
i
i
x
x n
n



If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is 5.1667
  if the data is taken from the population.
The value is a true mean or a parameter.
‒ the calculated mean is 5.1667
x  if the data is taken from the sample.
The value is a sample mean or a statistic.
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
 as population
standard deviations, and 1
n
x  or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
Is the middle number of n ordered data (smallest to largest)
If n is odd If n is even
1
2
Median(MD) n
x 
 1
2 2
Median(MD)
2
n n
x x



Median
If the data set is 1, 3, 5, 6, 7, then the calculated median is, 3
Median 5
x
  .
If the data set is 1, 3, 5, 7, 7, 8, then the calculated median is, 3 4
Median 6
2
x x

  .
SZS2017
Is the most commonly occurring value in a data series
Mode
EXAMPLE 1.4:
a) If the data set are 1, 6, 3, 7, 8, 5 then the mode is not exist.
b) If the data set are 1, 6, 3, 7, 8, 3, 5 then the mode is 3.
c) If the data set are 1, 6, 3, 7, 3, 8, 7, 5, 3, 7 then the mode is 3 and 7.
 The mode is used when the most typical case is desired.
 The mode is can be used when the data are nominal.
 The mode is not always unique.
 A data set can have more than one mode, or the mode may not
exist for a data set.
Properties of Mode
SZS2017
Identify the Shapes of Data
Distribution
Symmetric Positively skewed /
right-skewed
Negatively skewed/
left-skewed
Mean Median Mode
  Mean Median Mode
  Mean Median Mode
 
→In reality, median can be greater than mode or mean values.
→The shape of the distribution may be identified by observing the
position of the mode value.
SZS2017
EXAMPLE 1.3
If the data set is 1, 3, 5, 7, 7, 8, then
‒ the calculated mean is 5.1667
  if the data is taken from the
population. The value is a true mean or a parameter.
‒ the calculated mean is 5.1667
x  if the data is taken from the sample.
The value is a sample mean or a statistic.
‒ the calculated median is, 3 4
Median 6
2
x x

  .
‒ the mode is 7.
‒ the shape of distribution is negatively skewed since
Mean Median Mode
  .
SZS2017
RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
 as population
standard deviations, and 1
n
x  or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
 The mean is unique, and not necessarily one of the data values.
 The mean is affected by extremely high or low values and if it occurs, the
mean may not be the appropriate average to use.
 The mean is used in computing other statistics, such as variance.
 The mean cannot be computed for an open ended frequency distribution.
 The mean varies less than the median or mode when samples are taken from
the same population and all three measures are computed for these samples.
 The mean is not an appropriate average to use if the shape of distribution is
skewed.
 The median is used when one must find the center or middle value of a data
set.
 The median will make sure that the data values fall into the upper half or
lower half of the distribution.
 The median is affected less than the mean by extremely high or extremely low
values.
Properties of Mean and Median
SZS2017
EXAMPLE 1.5
An extreme value, let say 21 is added to the data set in Example 1.3. The new
data set are 1, 3, 5, 7, 7, 8, 21. Assume that the data is taken from a sample, then
‒ the calculated mean is 7.4286 or 7.4286
x  . The mean is easily affected by
outliers and may not be the appropriate average to use. This new average
value is no longer representing the central of the data set.
‒ the calculated median is 7 or 4
Median 7
x
  . This new average value is
still representing the central of the data set.
‒ the mode is 7.
‒ the calculated midrange is,
1 21
11
2
MR

  . The midrange is easily
affected by outliers.
‒ the shape of distribution is positively skewed since mode is the smallest
value as compared with the mean and median values.
An extremely high or low value data that occur in a data set is called outlier.
SZS2017
EXERCISE 1.3.1
1. Determine the shape of distribution of the following
data.
a) Mean = Mode = Median = 11
b) Mean = 25, Mode = 13, Median = 17
c) Mean = 5, Mode = 73, Median = 17
d) 11.4, 11.6,12.6,12.7, 12.8, 13.3, 13.3, 13.6, 13.7,
13.8
SZS2017
a) symmetric b) right-skewed c) left-skewed d) Mean = 12.88, Median = 13.05, mode = 13.3, left-skewed
EXERCISE 1.3.1
2. The following set of data represents the number of hospitals
for selected countries.
123 108 195 138 115 179 119 148 147 180
146 178 189 108 193 114 179 147 108 128
164 174 128 159 193 175
a) Find the mean, median, mode, and midrange.
b) Is the average values calculated in (a), a parameter or a
statistic? Why?
c) What is the distribution type that describes the data?
d) What is the best measure of average of this set of data?
Why?
SZS2017
a) Mean = 151.3462, Median = 148, mode = 108 b) statistic c) right-skewed d) median
1.3.2 Measures of Variation/Dispersion
 Measures of variation or measures of dispersion are measures
that determine the spread of data values.
1. Range: the simplest measure of variation
2. Variance, and
3. Standard deviation.
4. Coefficient of Variation
 Measures of variation may help researchers to describe data
more accurately.
 Variance and standard deviation are used quite often in
inferential statistics.
more meaningful and popular
measures that describes the
variability of data
SZS2017
Is the different between the highest value and the lowest value in a
data set
R = highest value - lowest value
Range (R)
Properties of Range
 The simplest measure of variation.
 Easily affected by one extremely high or low value (outliers).
EXAMPLE 1.6:
Suppose the data set is 1, 6, 3, 7, 8, 5, then the calculated range is, 8 1 7
R    .
SZS2017
Population Variance Sample Variance
 
2
2 1
, population size
N
i
i
x
N
N

 


  
2
2 1
, sample size
1
n
i
i
x x
s n
n





Is the average of the squares of the distance each value is from the mean.
Is the square root of the variance
Population standard deviation ,  Sample standard deviation, s
 
2
1
, population size
N
i
i
x
N
N

 


  
2
1
, sample size
1
n
i
i
x x
s n
n





Variance
Standard Deviation
SZS2017
Properties of Variance & Standard Deviation
 The variance is the average of the squares of the distance each value
is from the mean.
 If the data values are near the mean, the variance will be smaller.
 If the data values are far from the mean, the variance will be larger.
 The square distance is used since the sum of the distances will
always be zero.
 Variance is always a positive value.
 There is no unit for the resultant variance.
 Standard deviation is the square root of the variance.
 Standard deviation is measure of deviations of values from the
mean.
 Standard deviation is always positive value.
 The units of standard deviation are similar as the unit of the data.
SZS2017
Population CVar Sample CVar
CVar 100%, for population


  CVar 100%, for sample
s
x
 
Is the standard deviation divided by the mean.
Coefficient of Variation
Properties of CVar
 The result is expressed as percentage.
 A parameter/statistic that allows user to compare the standard deviations
when the units are different (the variables are different).
RECALL: Descriptive Statistics using
Scientific Calculator
Note:
The notations used in the calculator are n as sample size, x as mean sample, n
x or x
 as population
standard deviations, and 1
n
x  or sx as sample standard deviations.
Casio fx-570MS
STEP 1: Insert data → MODE, SD, insert data, M+, AC
STEP 2: Data summary
Shift 1 →
Shift 2 →
STEP 3: Clear data → Shift CLR 1
Casio fx-570ES
STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC
STEP 2: Data summary:
Shift 1 → 3: Sum →
Shift 1 → 4: Var →
STEP 3: Clear data → Shift 9
SZS2017
EXAMPLE 1.6
SZS2017
Suppose the data set is 1, 6, 3, 7, 8, 5, then
‒ the calculated range is, 8 1 7
R    .
‒ the calculated variance is 2
5.6667
  and the standard deviation is 2.3805
 
if the data is taken from the population. These values are called as parameters.
‒ the calculated variance is 2
6.8
s  and the standard deviation is 2.6077
s  if the
data is taken from the sample. These values are called as statistics.
‒ the calculated sample mean is, 5
x  . Hence the sample coefficient of variation
is
2.6077
CVar 100% 52.15%
5
   .
Why we Need Measures of Variation
• Measures of variation can be a judgment about how well the
measures of average illustrate or depict the data.
• It is also called measure of variation because it can measure the
variability that exists in a data set.
• It can be used when the measures of central tendency do not give
any significant meaning or not needed/practical.
EXAMPLE:
Suppose we wish to compare the performance of two groups of student
in a test. Given that the mean values are the same for both data sets.
In short, you might conclude that these two groups of students are
equally well performed in the test. However, if the data sets are
examined graphically as shown in Figure 1.10, a different conclusion
might be drawn.
SZS2017
Examining Data Sets Graphically
 Both group have same total number of students.
 Students are given the same set of test and the mean of score is
calculated as 66.67 marks for each group of students.
 The mean values are the same but the spread or variation of the
test score is quite different.
 The test score for students from Group B is more consistent and
less variable.
 When the mean values are equal, the larger the data range is, the
more the variable the data.
SZS2017
Comparing Two Data Sets
Smaller standard deviation indicate that:
POPULATION 1 is POPULATION 2 is
 Less dispersed
 Less spread
 Less variable (small variation)
 More consistent
 More precise
 More accurate
 Better data
 More dispersed
 More spread
 More variable (large variation)
 Less consistent
 Less precise
 Less accurate
 Worse data
 
1 2
 

Same interpretation is applicable for range and variances
SZS2017
EXAMPLE 1.7
The following data represents the age (in years) of lecturers in two faculties at UMP.
FIST: 24, 25, 26, 27, 30, 31, 31, 32, 36, 40, 43, 44, 45
FKEE: 22, 25, 25, 25, 28, 33, 34, 36, 37, 40, 41, 43, 48, 51, 53
For these sample data sets, find the standard deviations. Then, identify which data set
is more consistent and less dispersed. What can you say about the variation of age for
lecturers in both faculties?
Solution:
7.4670
FIST
s  years
9.9460
FKEE
s  years
FIST FKEE
s s
 , so FIST data is more consistent and less dispersed.
The variation of ages for lecturers in FIST is small and less dispersed as
compared to FKEE lecturers.
SZS2017
1. Which of the following set of sample data is less variable?
Method A: 79 73 78 76 80 75 82 70 77
Method B: 80 85 78 79 75 73 70 60 65
2. The following set of sample data represents the battery
lifetime (in hours) from two different brands. Which brand of
battery is performed better?
A: 4.2, 6.7, 7.3, 7.5, 8.0,8.5, 8.7, 8.8, 9.2, 9.3
B: 9.6, 9.7, 9.8, 9.9, 10.1, 10.2, 11.0, 11.0, 11.0, 11.1
EXERCISE 1.3.2 (Q1&Q2)
SZS2017
3.6742 7.8493
A B
s s
  
1.5 hours 0.6 hours
A B
s s
  
Comparing Two Data Sets with
different units/variable
SZS2017
 If the two samples do not have the same units of measurement or the
variables are different, the variance and standard deviation for each
sample cannot be compared directly.
 As an example: suppose a car dealer wants to compare the variation
between the number of sales of car for a year and the commission (in
RM) made by the salesperson. It is very clear that these two
variables have two different units.
 Hence, the best way to compare the variability within these two
variables is by using the coefficient of variation.
 It is means that if    
1 2
CVar CVar
 , then the variable one is less
variable than the variable two.
3. The average age of the accountants at a huge company is 31
years with a standard deviation of 4 years. The average
salary of the accountants is RM 44255 per year with a
standard deviation of RM 780. Compare the variations of
age and income.
EXERCISE 1.3.2 (Q3)
SZS2017
 
   
 
CVar 12.90% CVar 17.63%
age income
  
Other Properties of Standard Deviation
 Use to determine the number of data values that fall within a
specified interval in a distribution.
 The values under curve indicate the percentage of area in each
section or range of data.
 It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
1.3.2.1 Accuracy and Precision
Concept (Validity and Reliability)
 Accuracy is how close a measured
value to the ‘true’ measurements.
 No measurement/device is
perfect (can easily be inaccurate
and lead to false measurements).
There is still a tolerance for error.
 Accuracy must be accounted for in
your results.
 The bigger the difference between
the measured and the true values,
the less accurate (less valid) the
measurement.
 Precision is how close the measured
value to each other or how consistent
your results are for the same
phenomena over several
measurements.
 Precision as a measure of variation
must be accounted in your
calculations and results.
 The precision of a measurement is the
size of unit used to make a
measurement. The smaller the unit,
the more precise (more reliable) the
measurement.
→ The concept is important to ensure that data collected from an
experiment or observation is good, valid, and reliable.
SZS2017
Game of Darts
• A very accurate
(close to the mark)
measurements, but
not very precise,
since the darts are
spread out
everywhere
• Valid but not
reliable
• Precision
without
accuracy
• Very
consistent, but
not near the
mark
• Not valid but
reliable
• Inaccuracy and
imprecision
• Not valid and not
reliable
• Accurate and
precise.
• Valid and reliable
• Very good
measurement
SZS2017
EXERCISE 1.3.2 (Q4)
4. Identify each situation as either accurate or precise or both.
a) If you are playing football and you always hit the left goal post
instead of scoring.
b) A candy manufacturer claims that each packet contains 20 candies.
A sample of packet have 18, 21, 19, 21, 19, 20, 22 candies,
respectively. The average is 20 candies with an error of 1 candy.
c) A manufacturer claims that each chocolate packet contains 20
chocolates. A sample of packets have 17, 18, 18, 17, 18, 17, 17
chocolates, respectively.
d) In an experiment, with five trials, the end results of the five trials for
whatever is being tested are: 35 kg, 36 kg, 36 kg, 35 kg, 36 kg. The
actual value (as found in a scientific data book) is meant to be 42 kg.
e) In an experiment, with five trials, the average value is 35 kg. The
actual value (as found in a scientific data book) is meant to be 35 kg.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
5. Which do you think has more variation: the IQ scores of 30 students
in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?
6. Explain why median and interquartile range are more appropriate
measures as compared to mean and variance for non-normal data.
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
MIND EXPANDING EXERCISES
8. In an analysis of the accuracy of weather forecasts, the actual high
temperature are compared to the high temperatures predicted one day earlier
and the temperatures predicted five days earlier. Listed below are the errors
between the predicted temperatures and the actual high temperatures for 14
consecutive days in Kuala Lumpur.
a) Do the means and medians of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
b) Do the standard deviations of the errors indicate that the temperatures
predicted one day in advance are more accurate than those predicted
five days in advance, as we might expect?
SZS2017
Actual high ‒
High predicted one day earlier
2 2 0 0 ‒ 3 ‒ 2 1
‒ 2 8 1 0 ‒ 1 0 1
Actual high ‒
High predicted five days earlier
0 ‒ 3 2 5 ‒ 6 ‒ 9 4
‒ 1 6 ‒ 2 ‒ 2 ‒ 1 6 ‒ 4
ME.8 (solution)
SZS2017
Mean median sd
1.5000 1.0000 2.4152
3.8333 4.5000 2.4014
MIND EXPANDING EXERCISES
9. A data set consists of 20 values that are fairly close together. Another
value is included, but this new value is an outlier (very far away from
the other values). How is the standard deviation affected by the
outlier? No effect? A small effect? Or a large effect?
10. Suppose scores on psychological test have a mean of 90 and a standard
deviation of 10. Meanwhile, scores on the economics test have a mean
of 55 and a standard deviation of 5. Which is relatively better: a score
of 85 on a psychological test or a score of 45 on an economics test?
11. When designing the production procedure for batteries used in heart
pacemakers, an engineer specifies that “the batteries must have a
mean life greater than 10 years, and the standard deviation of the
battery life can be ignored.” If the mean battery life is greater than 10
years, can the standard deviation be ignored? Why or why not?
SZS2017
1.3.3 Measures of Position
Describe where a specific data value falls within the data set or its
relative position based on percentiles, deciles and quartiles in
comparison with other data values
SZS2017
Describing the position of
the data value
(increasing order)
Percentiles
Split data into
100 equal parts
Deciles
Split data into
10 equal parts
Quartiles
Split data into
4 equal parts
4
i in c
Q x x
 
10
i in c
D x x
 
100
i in c
P x x
 
4
i in c
Q x x
 
10
i in c
D x x
 
100
i in c
P x x
 
 If c is not a whole number, round it up to the next whole number.
 If c is a whole number, then use 1 1 1
, ,
2 2 2
c c c c c c
i i i
x x x x x x
Q D P
  
  
  
SZS2017
EXAMPLE 1.9
The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45
Quartiles Percentiles
 
1 2.75 3
1 11
4
27
Q x x x
   
 
2 5.50 6
2 11
4
36
Q x x x
   
 
3 8.25 9
3 11
4
42
Q x x x
   
 
25 2.75 3
25 11
100
27
P x x x
   
 
50 5.50 6
50 11
100
36
P x x x
   
 
75 8.25 9
75 11
100
42
P x x x
   
Summary: 1
Q equivalent to 25;
P 2
Q equivalent to 50;
P 3
Q equivalent to 75.
P
A manufacturer measured the volume of a sample of 11 bottles of chemical
solvents. The results are recorded (in millilitres) as follows.
40 45 38 25 42 31 30 44 26 27 36
SZS2017
Show that 1
Q equivalent to 25,
P 2
Q equivalent to 50,
P 3
Q equivalent to 75
P , and i
D
equivalent to (10),
i
P where 1, 2, , 9
i  .
EXAMPLE 1.9
The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45
Deciles Percentiles
 
3 3.3 4
3 11
10
30
D x x x
   
 
5 5.5 6
5 11
10
36
D x x x
   
 
7 7.7 8
7 11
10
40
D x x x
   
 
30 3.3 4
30 11
100
30
P x x x
   
 
50 5.5 6
50 11
100
36
P x x x
   
 
70 7.7 8
70 11
100
40
P x x x
   
Summary: i
D equivalent to (10) ,
i
P where 1, 2, 3, 4, 5, 6, 7, 8, 9
i  .
SZS2017
EXERCISE 1.3.3
1. Given a set of data as 9 2 1 4 3 7 5 4 6 .
a) Find the value corresponds to 4th deciles.
b) Find the value corresponds to 3rd quartiles.
2. A teacher gives a 25-point test to ten students. The scores
are shown below.
9 22 11 14 13 3 7 15 18 16
a) Find the score corresponds to 20th percentiles.
b) Find the score corresponds to 7th deciles.
SZS2017
1) 4, 6 2) 8, 15.5
Why We need Measures of Position?
 Percentiles are one of measures of position that often used in
educational and health related fields to indicate the position
of an individual in a group.
 Percentile is not a percentage value. The ith percentile, is a
value that i % of the data are less than or equal to Pi and
(100-i) % are greater than or equal to Pi.
EXAMPLE:
If a student obtained 82 marks over 100 in a test , he/she will
obtain 82% of score. However, there is no indication of his/her
position with respect to the rest of the class. On the other hand,
if his/her score corresponds to the 75th percentile, then he/she
did better than 75% of the students in his/her class.
SZS2017
Why We need Measures of Position?
Quartiles can be used as a rough measurement of variability.
INTERQUARTILE RANGE (IQR)
 defined as the difference between Q1 and Q3 and is the range
of the middle 50% of the data.
 used to identify outliers, and to measure variability in
exploratory data analysis (Section 1.4).
 the smaller the value of IQR; the smaller the variation in the
data.
 useful to show the variability of the data set, either its more
variation, more dispersed, more spread or more consistent.
SZS2017
MIND EXPANDING EXERCISES
4. In what sense are the mean, median, mode and midrange measures
the “centre”? of a data set?
5. Which do you think has more variation: the IQ scores of 30 students
in a statistics class or the IQ scores of 30 teenagers watching a
movie? Why?
6. Explain why median and interquartile range are more appropriate
measures as compared to mean and variance for non-normal data.
7. A JDT football fan records the number on the jersey of each player
in a game. Does it makes sense to calculate the mean of those
numbers? Why or why not?
SZS2017
1.3.4 Descriptive Statistics
Using Microsoft Excel
SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ 1 2
,
x x
 in average Assembly Line 2 produced more
number of finished units per day.
→    
1 2 1 2 1 2
, and s.e s.e .
s s R R
   The arrangements
of Assembly Line 1 is more consistent, less dispersed,
less spread, less variable (small variation), and more
precise. Therefore the arrangements of Assembly
Line 1 will be adopted as the new arrangement.
→ For Assembly Line 1, the distribution of data is
negatively skewed or left-skewed since
Mean Median Mode
  . The skewness value is
negative too.
→ For Assembly Line 2, the distribution of data is also
negatively skewed or left-skewed since the mode is
the highest value compared to mean and median. The
skewness value is negative too.
SZS2017
Interpreting Descriptive Statistics
Using Microsoft Excel (Example 1.9)
A firm is conducting a study to compare two different physical
arrangements of its assembly line. The arrangement with the smaller
variance in the number of finished units produced per day will be adopted
as the new arrangement of its assembly line.
→ The skewness value for Assembly Line 2 is higher
that the Assembly Line 1. Hence the distribution of
data from Assembly Line 2 is more skewed to the
left, indicating that Assembly Line 2 produced more
number of finished units per day.
→ For Assembly Line 1,
 
1
Confidence Level 491.1 17.1 474,508.2
x     .
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 1 is lies between 474 and 509 units.
→ For Assembly Line 2,
 
2
Confidence Level 499.4 25.2 474.2,524.6
x    
Hence, we are 95% confident that the population
mean number of finished units per day for Assembly
Line 2 is lies between 475 and 525 units.
SZS2017
MIND EXPANDING EXERCISES
12. A lecturer is interested to investigate the students’ performance in
statistics course based on their carry mark and the final score in
the final examination. The descriptive statistics and graph are
given below. From the analyses, comment on the students’
performance based on carry marks and final examination scores.
SZS2017
MIND EXPANDING EXERCISES
ME.12
SZS2017
MIND EXPANDING EXERCISES
13. A study is conducted to compare the performance of male and female
students in the statistics course for final examination scores. The
data, descriptive statistics and graph of the final examination scores
are presented as follow. Based on the analysis, answer the following
questions:
Female
72 62 83 65 60 74 66 68 57 63 61
76 60 78 34 70 59 63 86 43 90 87
Male
58 81 86 68 70 77 54 54 72 41 33 52
70 37 67 39 74 32 8 33 27 23 54
SZS2017
MIND EXPANDING EXERCISES
a) State the mean and standard deviation for both groups and give your
comment.
b) Based on the graph shown, give your comment.
ME.13
SZS2017
MIND EXPANDING EXERCISES
14.People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
a) How many data in each group?
b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?
SZS2017
MIND EXPANDING EXERCISES
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Before After
Key: 14|1=141
SZS2017
1.4 EXPLORATORY
DATA ANALYSIS
 Identify outliers.
 Draw and interpret a boxplot.
SZS2017
Exploratory Data Analysis
 The purpose of exploratory data analysis is to discover any gaps or
pattern in the data.
 For symmetric data, the appropriate measure of central tendency
is mean and for variability is standard deviation or variance.
 For skewed data, the appropriate measure of central tendency is
median and for measure of variability is interquartile range (IQR).
Traditional Method Exploratory Data Analysis
Frequency distribution Stem and leaf plot
Histogram Boxplot
Mean Median
Standard deviation
Interquartile range
(IQR=Q3-Q1)
SZS2017
RECALL: Selection of appropriate
statistical techniques for data
summarisation
Type of Data Descriptive Statistics Graphical Summary
Quantitative
(ratio scale)
Mean, Median, Mode,
Range, Standard Deviation,
Interquartile range (IQR
=Q3-Q1)
Histogram, Bar Chart (bar
representing means), stem
and leaf plot, Boxplot
Symmetrical
Distribution
Mean, Median, Mode,
Range, Standard Deviation
Histogram, Bar Chart (bar
representing means)
Skewed Distribution Median, Range, Interquartile
range (IQR =Q3-Q1)
Histogram, Stem and leaf
plot, Boxplot
Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart
Categorical
(Ordinal, Likert Scale)
Mode, Mean, Counts,
Percentage
Pie Chart, Bar Chart
SZS2017
Histogram, Stem and Leaf OR Boxplot?
Type of Graph Advantages Disadvantages
Histogram ‒ Can graph huge data sets easily.
‒ The shape of distribution can be easily
described.
‒ You could change the intervals of the
histogram to see which gives a better
description of the data.
‒ Great for comparing data.
‒ Can show trends in the data clearly.
‒ Not good for small data set.
‒ It is difficult to simplify all
the data into one scale.
Stem and Leaf ‒ Very easy to construct.
‒ Show the real value of data
‒ Can shows range, minimum &
maximum, gaps & clusters, and
outliers easily.
‒ May observe the mode.
‒ Can identify the shape of distribution.
‒ Not good for small data set
or very large data set.
‒ Not visually appealing.
‒ Does not easily indicate
measures of centrality for
large data sets.
Boxplot ‒ Good for small or large data sets.
‒ It displays the range and distribution
of data along a number line.
‒ Can shows outliers.
‒ Original data is not clearly
shown in the box plot.
‒ Mean and mode cannot be
identified in a box plot.
SZS2017
1.4.1 Outliers
 Outlier is an extremely high or an extremely low data value when
compared with the rest of the data values.
 Outliers can happen from:
 the result of measurement or observational error,
 the written or typing error,
 the data value obtained from a subject that is not in the defined
population, or
 the legitimate data value occurred by chance.
 When a distribution is symmetric or normal, data values that are
beyond three standard deviations of the mean can be considered
as suspected outliers (refer Figure 1.11).
 An outlier can strongly affect the mean and standard deviation of a
variable.
SZS2017
Recall: Other Properties of Standard Deviation
 Use to determine the number of data values that fall within a
specified interval in a distribution.
 The values under curve indicate the percentage of area in each
section or range of data.
 It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎
and 𝜇 + 2𝜎.
SZS2017
Position of Outliers
A data value x is an outlier if it less than the lower boundary value or
exceed the upper boundary value for the data set.
SZS2017
→ Since , thus there is no outlier.
EXAMPLE 1.11
The number of credits in business courses for eight job applicants is
shown here:
9, 12, 15, 27, 33, 45, 63, 72.
Find the first and third quartiles for the above data. Is there any
outlier on the above data?
SZS2017
 
 
 
 
2 3
1 2
1 8
4
6 7
3 6
3 8
4
1 3 1
3 3 1
13.5
2
54
2
lower boundary: 1.5 13.5 1.5(54 13.5) 47.25
upper boundary: 1.5 54 1.5(54 13.5) 114.75
x x
Q x x
x x
Q x x
Q Q Q
Q Q Q

   

   
      
     
47.25 114.75
x
  
EXERCISE 1.4.1
1. Given 19 2 1 4 3 7 5 4 6 . Find outliers if any.
2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find
outliers if any.
SZS2017
1 3
3, 6; 19 is outliers
Q Q
 
1 3
5, 11; 21is outliers
Q Q
 
MIND EXPANDING EXERCISES
14.People with diabetes must monitor and control their blood glucose level. The
goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The
data presented below give the fasting plasma glucose for two groups, before
treatment and after treatment. Answer the following questions:
a) How many data in each group?
b) Give the first five data in the ‘before’ group and last five data in the ‘after’
group.
c) Identify the median and mode in each group.
d) Describe the shape of the distribution of data in each group.
e) Is there any outlier in the groups?
f) What are the advantages of using stem and leaf plot?
g) Which data is more dispersed (consistent)?
h) Based on the descriptive analysis done in Excel, why do you think that
the dispersion for both groups using variance is different from variance
given by IQR?
SZS2017
MIND EXPANDING EXERCISES
ME.14 8 7
8
6 5 9
3 10
2 11
12 8 8
4 13
7 5 8 1 14
3 8 15 8 9
16 3 4 0
2 2 17
18 8
19 5 8
0 20
21
22 7 6 3 1 0
23
24
5 25
26
1 27
28 3
29
30
31
32
33
34
9 35
Before After
Key: 14|1=141
SZS2017
1.4.2 Boxplots
SZS2017
 The lowest value of data set (minimum)
 The lower quartile Q1 (1st Quartile or 25th percentile)
 The median (2nd Quartile or 50th percentile)
 The upper quartile Q3 (3rd Quartile or 75th percentile)
 The highest value of data set (maximum)
 Outliers
Boxplot (Box and Whiskers plot) is graphical representations of a five-
number summary of a data set and outliers.
five-number
summaries
+ Outliers
Types of Boxplots
A Vertical boxplot
A Horizontal boxplot
SZS2017 SZS2017
EXAMPLE 1.12
SZS2017
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4
Given that for School B, 1 2 3
36, 42, 47
Q Q Q
   and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
[key: 3|4 → 34]
School A School B
Minimum 24 22
1st
quartile  
1 3.5 4
1 14
4
27
Q x x x
    1 36
Q 
2nd
quartile/
Median
7 8
2 30.5
2
x x
Q

  2 42
Q 
3rd
quartile  
3 10.5 11
3 14
4
36
Q x x x
    3 47
Q 
Maximum 38 54
Outliers  
 
1 3 1
3 3 1
1.5 27 1.5(36 27) 13.5
1.5 36 1.5(36 27) 49.5
Q Q Q
Q Q Q
     
     
Since 57 > 49.5, Thus 57 is an outlier.
no outlier
Information Obtain from a Boxplot
1. If the median is near the centre of the box, the distribution is approximately
symmetric.
2. If the median falls to the left of the centre of the box, the distribution is positively
skewed.
3. If the median falls to the right of the centre of the box, the distribution is
negatively skewed.
 Suppose the median is near the centre of the box (approximately symmetric):
4. If the lines are about the same length, the distribution is approximately
symmetric.
5. If the right line is larger than the left line, the distribution is positively skewed.
6. If the left line is larger than the right line, the distribution is negatively skewed.
 If the boxplots for two or more data sets are graphed on the same axis, the
distributions can be compared using their central tendency (average) and
variability values.
 To compare the average, use the location of the medians.
 To compare the variability, use the length of the IQR.
SZS2017
EXAMPLE 1.12
SZS2017
The following mixture stem and leaf plot represent sample of age of teachers in two
schools.
School A Stem School B
9 7 7 5 5 4 2 2
8 7 6 2 1 1 0 3 3 4 6 7
4 0 1 3 4 5 7
7 5 1 3 4
Given that for School B, 1 2 3
36, 42, 47
Q Q Q
   and there is no outlier. Draw Boxplots
for both schools on the same x-axis. Then compare shapes, averages, and variability of
both age distributions
[key: 3|4 → 34]
School A School B
Minimum 24 22
1st
quartile  
1 3.5 4
1 14
4
27
Q x x x
    1 36
Q 
2nd
quartile/
Median
7 8
2 30.5
2
x x
Q

  2 42
Q 
3rd
quartile  
3 10.5 11
3 14
4
36
Q x x x
    3 47
Q 
Maximum 38 54
Outliers  
 
1 3 1
3 3 1
1.5 27 1.5(36 27) 13.5
1.5 36 1.5(36 27) 49.5
Q Q Q
Q Q Q
     
     
Since 57 > 49.5, Thus 57 is an outlier.
no outlier
EXAMPLE 1.12 solution
SZS2017
Shape:
Based on the location of median, School A has right-skewed distribution where most of
teachers’ age is concentrated at the lower age (< 30 years old). However, School B has
left-skewed distribution where most of teachers’ age is greater than 42 years old.
Average:
Based on the median value, 50% of teacher at School A age less than 30.5 years old
whereas 50% of teacher at School B age less than 42 years. On average, teachers at
School B is older than the teachers at School A.
EXAMPLE 1.12 solution
SZS2017
Variability:
Based on the IQR value, for School A, IQRA = 9 years where most 50% of the teachers
age between 27-36 years old. Meanwhile, for School B, IQRB = 11 years where most
50% of the teachers age between 36-47 years. Hence, the variation of teachers’ age at
School B is higher than age of teacher at School A (IQRA < IQRB).
Range:
Without outlier, teachers’ age at school A varies less from minimum age of 24 years to
maximum age of 38 years as compared to School B with minimum age of 22 years to
maximum of 54 years.
Boxplot for Special Case
 In some cases, we cannot use the general guideline as given above to interpret the
boxplot.
 Boxplot is not the best graphical representation to describe a data set if the sample
size of the data set is too small.
 The existence of outliers also may affect the boxplot.
 Therefore, in such cases, we have to use the descriptive statistics to identify the
distribution of the data set.
SZS2017
EXERCISE 1.4.2 (Q1)
1. Plot a boxplot for the following data. Then describe the data.
a) 3.2, 5.9, 4.3, 6.9, 4.5, 8.0, 4.7, 8.9, 5.7, 11.9
b) 5.8, 9.7, 6.7,13.4, 6.8, 14.7, 7.2, 16.4, 8.2, 28.1
SZS2017
1 2 3
3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed
Min Q Q Q Max
    
1 2 3
5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed
Min Q Q Q Max
    
1.4.2 (Q1) solution
SZS2017
1 2 3
5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed
Min Q Q Q Max
    
1 2 3
3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed
Min Q Q Q Max
    
EXERCISE 1.4.2(Q2)
2. Two samples of ten springs made out of the steel rods supplied by
two different companies were compared. The measurement of
flexibility (in N/m) for each spring was recorded as follows. Compare
the distributions using box-plots.
Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7
8.8 9.2 9.3
Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0
11.0 11.0 11.1
Give comment on the flexibility of springs supplied by two different
companies.
SZS2017
1 2 3
1 2 3
Company A: 6.7, 7.3, 8.25, 8.8, 4.2 is outlier, 9.3, left-skewed
Company B: 9.6, 9.8, 10.15, 11.0, no outlier, 16.4, right-skewed
Min Q Q Q Max
Min Q Q Q Max
    
    
1.4.2 (Q2) solution
EXERCISE 1.4.2 (Q3)
3. The following Table presents viscosity (in Pascal) of chemical substance from
three (3) batches of chemical process.
Batches Viscosity
Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3
Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8
Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9
a) Complete the table below by showing all the necessary calculations.
Measures of position Batch A Batch B Batch C
1st
quartile 14.30 14.10
Median 14.55 14.55
3rd
quartile 15.40 15.80
Outlier No No
b)Draw three boxplots on the same x-axis by using the information in (a).
c) Compare the boxplots in terms of shape and variability.
SZS2017
3 2 1
Batch A : 15.2, right-skewed; Batch B: 15.05, no outlier, left-skewed; Batch C : 14.1, right-skewed
Q Q Q
  
1.4.2 (Q3) solution
12
12.5
13
13.5
14
14.5
15
15.5
16
16.5
17
Batch A Batch B Batch C
MIND EXPANDING EXERCISES
ME.15
SZS2017
MIND EXPANDING EXERCISES
15. An experiment was conducted to assess the potency of various constituents of
orchard sprays in repelling honeybees. Individual cells of dry comb were filled
with measured amounts of lime Sulphur emulsion in sucrose solution. Seven
different concentrations of lime Sulphur ranging from a concentration of 1/100
to 1/1,562,500 in successive factors of 1/5 were used as well as a solution
containing no lime Sulphur (A, B, C, D, E, F, G, H). The responses for the
different solutions were obtained by releasing 100 bees into the chamber for
two hours, and then measuring the decrease in volume of the solutions in the
various cells. Based on the figure below, answer the following questions:
a) Which concentration has outlier(s)?
b) Group the concentration according to their shape of distribution.
c) Which concentration has the most consistent data? Why?
d) Which concentration has the most variable data? Why?
e) H is the concentration of ‘no lime sulphur’. What is the use of
concentration H?
f) What conclusion can you draw from this experiment?
SZS2017
1.5 NORMAL
PROBABILITY PLOT
 Draw and interpret a normal probability plot.
SZS2017
Normal Probability Plots
 The easiest way to check whether the sample distribution is normal or not.
 The most plausible normal distribution is the one whose mean and standard deviation
are the same as the sample mean and standard deviation.
STEP 1 : Sort the data in ascending order and denote each sorted data as
, 1, , .
i
x i n

STEP 2 : Numbered the sorted data from i to n.
STEP 3 : Calculate the probability value for each xi using
0.5
i
i
p
n

 .
STEP 4 : Plot pi versus xi.
If the sample points lie approximately on a straight line,
the data is approximately normally distributed.
SZS2017
Testing Normality using
Software
Other than plot manually, we can obtain it from software such as SPSS,
Minitab, Excel, and etc. The normality of the data also can be tested by
using Kolmogorov Smirnov, Anderson Darling or Shapiro-Wilk Tests.
SZS2017
EXAMPLE 1.13
→ The graph pi versus xi from the
figure above is known as the
normal probability plot. Since the
data lies approximately on a
straight line, the data is normally
distributed.
SZS2017
EXERCISE 1.5
1. A sample of size six is drawn. The sample, arranged in
increasing order, is
3.01 3.35 4.79 5.96 7.89 9.15
Do these data appear to come from an approximately normal
distribution?
2. The data shown represent the number of movies in America for
14-year period.
2084 1497 1014 910 899 870 859
848 837 826 815 750 737 637
Do these data appear to come from an approximately normal
distribution?
SZS2017
1) yes 2) no
1.5 (Q1) solution
SZS2017
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10
1.5 (Q2) solution
SZS2017
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
0 500 1000 1500 2000 2500
Pi
xi
CONCLUSION
• The applications of statistics are
many and varied. People
encounter them in everyday life,
such as in reading newspapers or
magazines, listening to the radio,
or watching television.
• By combining all of the
descriptive statistics techniques
discussed in this chapter
together, the student is now able
to collect, organize, summarize
and present data.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017
REFERENCES
1. Walpole R.E., Myers R.H., Myers S.L. & Ye K. 2011. Probability and Statistics for Engineers
and Scientists. 9th Edition. New Jersey: Prentice Hall.
2. Navidi W. 2011. Statistics for Engineers and Scientists. 3rd Edition. New York: McGraw-Hill.
3. Triola, M.F. 2006. Elementary Statistics.10th Edition. UK: Pearson Education.
4. Bluman A.G. 2009. Elementary Statistics: A Step by Step Approach. 7th Edition. New York:
McGraw–Hill.
5. Weiss, N.A. 2002. Introductory Statistics. 6th Edition. United States: Addison-Wesley.
6. Sanders D.H. & Smidth R.K. 2000. Statistics: A First Course. 6th Edition. New York: McGraw-
Hill.
7. Crawshaw, J. & Chambers,J. 2001. A Concise Course in Advance Level Statistics with Work
Examples, 4th Edition, Nelson Thornes.
8. Satari S. Z. et al. Applied Statistics Module New Version. 2015. Penerbit UMP. Internal used.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
SZS2017

More Related Content

What's hot

Meaning and Importance of Statistics
Meaning and Importance of StatisticsMeaning and Importance of Statistics
Meaning and Importance of StatisticsFlipped Channel
 
Research design
Research designResearch design
Research designBalaji P
 
Use of statistics in real life
Use of statistics in real lifeUse of statistics in real life
Use of statistics in real lifeHarsh Rajput
 
What is statistics
What is statisticsWhat is statistics
What is statisticsRaj Teotia
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxshakirRahman10
 
Importance of statistics
Importance of statisticsImportance of statistics
Importance of statisticsSayantiniBiswas
 
Measures of central tendency mean
Measures of central tendency meanMeasures of central tendency mean
Measures of central tendency meanRekhaChoudhary24
 
Mean, median, mode, & range ppt
Mean, median, mode, & range pptMean, median, mode, & range ppt
Mean, median, mode, & range pptmashal2013
 
Advance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestAdvance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestJoshua Batalla
 
Sampling and sampling distributions
Sampling and sampling distributionsSampling and sampling distributions
Sampling and sampling distributionsStephan Jade Navarro
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data AnalysisAjendra Sharma
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendencyChie Pegollo
 
Standard Deviation and Variance
Standard Deviation and VarianceStandard Deviation and Variance
Standard Deviation and VarianceJufil Hombria
 
Probability distribution
Probability distributionProbability distribution
Probability distributionRohit kumar
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysissristi1992
 

What's hot (20)

Meaning and Importance of Statistics
Meaning and Importance of StatisticsMeaning and Importance of Statistics
Meaning and Importance of Statistics
 
Research design
Research designResearch design
Research design
 
Use of statistics in real life
Use of statistics in real lifeUse of statistics in real life
Use of statistics in real life
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
 
VARIANCE
VARIANCEVARIANCE
VARIANCE
 
What is statistics
What is statisticsWhat is statistics
What is statistics
 
Lecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptxLecture 5 Sampling distribution of sample mean.pptx
Lecture 5 Sampling distribution of sample mean.pptx
 
Goodness Of Fit Test
Goodness Of Fit TestGoodness Of Fit Test
Goodness Of Fit Test
 
Importance of statistics
Importance of statisticsImportance of statistics
Importance of statistics
 
Measures of central tendency mean
Measures of central tendency meanMeasures of central tendency mean
Measures of central tendency mean
 
Mean, median, mode, & range ppt
Mean, median, mode, & range pptMean, median, mode, & range ppt
Mean, median, mode, & range ppt
 
Advance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestAdvance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank Test
 
Sampling and sampling distributions
Sampling and sampling distributionsSampling and sampling distributions
Sampling and sampling distributions
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Bivariate data
Bivariate dataBivariate data
Bivariate data
 
Standard Deviation and Variance
Standard Deviation and VarianceStandard Deviation and Variance
Standard Deviation and Variance
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Math 102- Statistics
Math 102- StatisticsMath 102- Statistics
Math 102- Statistics
 

Similar to Introduction to Statistics Chapter

lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf
lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdflecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf
lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdfAtoshe Elmi
 
Definition Of Statistics
Definition Of StatisticsDefinition Of Statistics
Definition Of StatisticsJoshua Rumagit
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdfhabtamu292245
 
Day 1 - Introduction-To-Statistics.pptx
Day 1 - Introduction-To-Statistics.pptxDay 1 - Introduction-To-Statistics.pptx
Day 1 - Introduction-To-Statistics.pptxMJGamboa2
 
Statics for the management
Statics for the managementStatics for the management
Statics for the managementRohit Mishra
 
Statics for the management
Statics for the managementStatics for the management
Statics for the managementRohit Mishra
 
Lesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendencyLesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendencyPerla Pelicano Corpez
 
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxRunning Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxtodd521
 
Probability and statistics (basic statistical concepts)
Probability and statistics (basic statistical concepts)Probability and statistics (basic statistical concepts)
Probability and statistics (basic statistical concepts)Don Bosco BSIT
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxswapnaraghav
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statisticsRekhaChoudhary24
 
Statistical Analysis Of Data Final
Statistical Analysis Of Data FinalStatistical Analysis Of Data Final
Statistical Analysis Of Data FinalSaba Butt
 
1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.pptFatima117039
 
Statistics basics
Statistics basicsStatistics basics
Statistics basicsdebmahuya
 

Similar to Introduction to Statistics Chapter (20)

lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf
lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdflecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf
lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf
 
Statistics Exericse 29
Statistics Exericse 29Statistics Exericse 29
Statistics Exericse 29
 
Definition Of Statistics
Definition Of StatisticsDefinition Of Statistics
Definition Of Statistics
 
Statistics
StatisticsStatistics
Statistics
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdf
 
Day 1 - Introduction-To-Statistics.pptx
Day 1 - Introduction-To-Statistics.pptxDay 1 - Introduction-To-Statistics.pptx
Day 1 - Introduction-To-Statistics.pptx
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 
Statics for the management
Statics for the managementStatics for the management
Statics for the management
 
Lesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendencyLesson 1 05 measuring central tendency
Lesson 1 05 measuring central tendency
 
Mazda Presentation Topic
Mazda Presentation TopicMazda Presentation Topic
Mazda Presentation Topic
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxRunning Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
 
Probability and statistics (basic statistical concepts)
Probability and statistics (basic statistical concepts)Probability and statistics (basic statistical concepts)
Probability and statistics (basic statistical concepts)
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
 
Meaning and uses of statistics
Meaning and uses of statisticsMeaning and uses of statistics
Meaning and uses of statistics
 
Statistical Analysis Of Data Final
Statistical Analysis Of Data FinalStatistical Analysis Of Data Final
Statistical Analysis Of Data Final
 
1.1 statistical and critical thinking
1.1 statistical and critical thinking1.1 statistical and critical thinking
1.1 statistical and critical thinking
 
1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt
 
Statistics basics
Statistics basicsStatistics basics
Statistics basics
 
Basic Statistics
Basic  StatisticsBasic  Statistics
Basic Statistics
 

Recently uploaded

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escortsindian call girls near you
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGAPNIC
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneVIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneCall girls in Ahmedabad High profile
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Roomgirls4nights
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneCall girls in Ahmedabad High profile
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdfkeithzhangding
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 

Recently uploaded (20)

A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in East Of Kailash 9711199171 Delhi Enjoy Call Girls With Our Escorts
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 
Networking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOGNetworking in the Penumbra presented by Geoff Huston at NZNOG
Networking in the Penumbra presented by Geoff Huston at NZNOG
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service PuneVIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
VIP Call Girls Pune Madhuri 8617697112 Independent Escort Service Pune
 
Vip Call Girls Aerocity ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Aerocity ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Aerocity ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Aerocity ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With RoomVIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
VIP Kolkata Call Girls Salt Lake 8250192130 Available With Room
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
象限策略:Google Workspace 与 Microsoft 365 对业务的影响 .pdf
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 

Introduction to Statistics Chapter

  • 1. CHAPTER 1 INTRODUCTION TO STATISTICS Expected Outcomes  Able to define basic terminologies of statistics.  Able to apply the basic steps in the statistical problem-solving methodology for various applications.  Able to summarise and analyse data using measures of central tendency, measures of variation and measures of position.  Able to relate the concept of accuracy and precision of data using game of darts.  Able to conduct exploratory data analysis that includes numerical data analysis and various graphical displays.  Able to plot and interpret normal probability plot. SZS2017 CONTENT 1.1 Statistical Terminologies 1.2 Statistical Problem Solving Methodology 1.3 Review on Descriptive Statistics 1.3.1 Measures of Central Tendency 1.3.2 Measures of Variation 1.3.2.1 Accuracy and Precision 1.3.3 Measures of Position 1.3.4 Descriptive Statistics Using Microsoft Excel 1.4 Exploratory Data Analysis 1.4.1 Outliers 1.4.2 Box Plot 1.5 Normal Probability Plot SZS2017 1.1 STATISTICAL TERMINOLOGIES  Define the meaning of statistics, population, sample, parameter, statistic, descriptive statistics and inferential statistics.  Discuss the importance of statistics in daily lives. SZS2017 1.1.1 What is Statistics? Most people become familiar with probability and statistics through radio, television, newspapers, and magazines. For example, the following statements were found in newspapers:  Ten thousands parents in Malaysia have chosen StemLife as their trusted stem cell bank.  The death rate from lung cancer was 10 times higher for smokers compared to nonsmokers.  The average cost of a wedding is nearly RM10,000 in Malaysia.  In Malaysia, the median salary for men with a bachelor’s degree is RM 30,000 per year, while the median salary for women with a bachelor’s degree is RM 29,000 per year.  Globally, an estimated of 500,000 children under the age of 15 live with Type 1 diabetes.  Women who eat fish once a week are 29% less likely to develop heart disease. SZS2017
  • 2. What is Statistics?  The sciences of conducting studies to collect, organise, summarise, analyse, present, interpret and draw conclusions from data. Any values (observations or measurements) that have been collected  Collection and analysis of data are the most important part in research methodology.  Researchers must have a basic knowledge of statistics before starting any research or study involving data analysis.  Statistics is also used to analyse the results of surveys and as a tool in scientific research to make decisions based on controlled experiments, estimation, prediction, and quality control. SZS2017  Basic knowledge of statistics is needed in any disciplines or any field of research or study (in almost all fields of human endeavour) that involve data analysis.  The methods of statistics allow the researchers to design a valid experiment and finally draw a reliable conclusion or interpretation from the data they produced and analysed. Examples: In sports, statistician may keep records of the number of successful kicks a team scored during a football season. In public health, a doctor might be concerned with the number of child who are infected with a H1N1 virus during a certain year. In education, an educator might want to know if the performance of students in current semester are better than the previous semester. 1.1.2 Why we Need Statistics? SZS2017 1.1.2 Why we Need Statistics? Knowledge of statistics may help you in: 1. Describing the relationship between variables. a. A university admission director needs to find an effective way of selecting students. He designed a statistical study to see if there is a significance relationship between SPM result and the GPA achieved by first year students at his university. If there is a strong relationship, high SPM result will become an important criterion for admission. b. A management consultant wants to compare a client’s investment return for this year with related figures from last year. He summarises the revenue and cost data from both periods and find the relationship between these two variables. Based on his findings, he presents his recommendations to his client. Variables is a characteristic or attribute that can assume different values. These values are data. It is called random variables if the values are determined by chance. SZS2017 1.1.2 Why we Need Statistics? Knowledge of statistics may help you in: 2. Making better decision in the face of uncertainty. a. Suppose that a manager of Unisex Hair Stylist claimed that 90% of the customers are satisfied with the services. If a consumer activist feels that this is an exaggerated statement that might require legal action, the activist can use statistical inference techniques to decide whether or not to sue the manager. Therefore, the knowledge gained from studying statistics can enhance the awareness towards becoming better consumers. b. People can make intelligent decisions about what products to purchase based on consumer studies about government spending based on utilisation studies, and so on. SZS2017
  • 3. 1.1.3 Population and Sample Population (N) A complete collection of measurements, outcomes, objects or individuals under study. Tangible finite and the total number of subjects is fixed and could be listed → all computers in a room, all female students in a university, or all electrical components manufactured in a day, etc. Conceptual (Intangible) all values that might possibly have been observed and has an unlimited number of subjects. → simulated data from computer or instrument, number of germs on human body, all experimental data such as all measurements of length of metal rod, etc. Sample (n) A subset of the population that is observed SZS2017 Parameter and Statistic Parameter A numerical value that represents a certain population characteristic Statistic A numerical value that represents a certain sample characteristic  The average of weight for a sample of female students selected from all students in a university  The percentage of defective components in a sample of 100 electrical components  The average of weight of students from a population of students in a university  The percentage of defective components in a population of electrical components manufactured in a day Measurement Parameter Statistic Mean (Average) Variance Standard deviation Proportion  x 2  2 s  s  p SZS2017 EXAMPLE 1.1 A travel agent claims that the average number of rooms in large hotels in Pahang is 500 and the standard deviation is 165. A sample of seven hotels in Genting Highlands is selected and the average number of rooms is found to be 435 with standard deviation of 15. Based on the above example:  The population under study is all large hotels in Pahang.  The sample selected is seven large hotels in Genting Highlands.  The population under study is tangible since there are finite numbers of large hotels in Pahang.  The characteristic (variable) is number of rooms.  The parameters are   500 and 𝜎 = 165 since they describe the population characteristics.  The statistics are ҧ 𝑥 = 435 and s = 15 since they describe the sample characteristics. SZS2017 EXERCISE 1.1.3 The number of first year students at a residential college is 317 students. An IQ pre-test is given to all of them in their first week. The dean of admission collected data on 27 of them and found their mean score on the IQ pre-test was 51. The mean for the entire first year students was estimated to be approximately 51. A subsequent computer analysis of all first year students showed that the true mean (population mean) is 52. Based on the above statement, answer the following questions. a) What is the population? b) Is the population tangible or conceptual? c) What is the sample? d) What is the variable of the study e) Which number describes a parameter? f) Which number describes a statistic? SZS2017
  • 4. 1.1.4 Descriptive and Inferential Statistics Descriptive statistics  Includes the process of data collection, data organisation, data classification, data summarisation, and data presentation obtained from the sample.  Used to describe the characteristics of the sample.  Used to determine whether the sample represents the target population by comparing sample statistic and population parameter. Inferential statistics  Involves a process of generalisation, estimations, hypothesis testing, predictions and determination of relationships between variables.  Used to describe, infer, estimate, approximate the characteristics of the target population.  Used when we want to draw a conclusion for the data obtain from the sample. EXAMPLE: Ten thousands parents in Malaysia have chosen Takaful Insurance as their trusted life insurance agency. EXAMPLE: The death rate of lung cancer was 10 times higher for smokers compared to nonsmokers . SZS2017 Overview of descriptive and inferential statistics SZS2017 EXERCISE 1.1.4 In the statements below, decide whether the statements describe the descriptive statistics or inferential statistics. a) The average cost of a wedding is nearly RM10,000. b) In Malaysia, the median salary for men with a bachelor’s degree is RM 30,000 per year, while the median salary for women with a bachelor’s degree is RM 29,000 per year. c) Globally, an estimated of 500,000 children under the age of 15 live with Type 1 diabetes. d) A researcher claims that a new drug will reduce the number of heart attacks in men over 70 years of age. SZS2017 1.1.5 Role of the Computer in Statistics Two software tools commonly used for data analysis: 1. Spreadsheets  Microsoft Excel & Lotus 1-2-3 2. Statistical Packages  AMOS, eViews, MINITAB, R, SAS, SmartPLS, SPSS and SPlus SZS2017
  • 5. Data Analysis Application Tools in EXCEL 1. Graph and chart 2. Formulas 3. Data Analysis Tools: File → Options → Add-Ins → Analysis ToolPak → ok → Data → Data Analysis SZS2017 Chose Analysis ToolPak and click Go SZS2017 Tick Analysis ToolPak and click ok SZS2017 → Now we can use the Data Analysis Application in Microsoft Excel to analyse data. SZS2017
  • 6. 1.2 STATISTICAL PROBLEM- SOLVING METHODOLOGY  Outline the six basic steps in the statistical problem-solving methodology.  Identify various sampling methods.  Classify type of data and level of measurement. SZS2017 Statistical Problem-Solving Methodology SZS2017 Statistical Problem-Solving Methodology SZS2017 1.2.1 Identify the Problem or Opportunity The researchers must clearly understand and define the objective of the study before conducting any research. Possible questions that could be asked before starting any study are given as follows.  What are the problem and objective of the study?  What are the possible variables that are related to the study?  Can the study goal be achieved through simple counts or measurements of the group?  What are possible treatments should be imposed on the group and what are their responses?  Should the experiment be performed on the group?  Do the data come from population or sample?  If samples are needed, how large the sample size is appropriate? How should they be taken? SZS2017
  • 7. Characteristics of Sample  A sample is a subset of population.  The population is a complete group of people, companies, hospitals, stores, university, students, and etc., that share some set of characteristics.  A census involves the whole population which possesses a greater likelihood of non-sampling errors.  Sampling error is calculated when the statistical characteristics of a population are estimated from a subset, or sample, of that population. The difference between the sample and population values is considered as a sampling error.  Non-sampling errors is an error that are not due to sampling. As example, in a survey, mistakes may occur in the selection of people. SZS2017 Characteristics of Sample Size  The larger the sample size, the smaller the magnitude of sampling errors would be.  Studies using survey method need a larger sample size since the survey is a voluntarily based.  Studies using mail response need a much larger sample size. Normally, the response is as low as 20%-30% responses.  The ideal sample size in a study should be large enough to serve as an adequate representative of the population in order to generalise the overall population.  The optimal sample size depends on statistical distribution used and for the purpose of generalisation to the whole population.  Researcher may refer to Krejcie and Morgan (1970) as a guideline to obtain an adequate sample size. SZS2017 1.2.2 Deciding on the Method of Data Collection  Data must be collected as complete as possible, accurate & relevant to the problem in order to solve the problem.  Data could be obtained in 3 ways: 1) Data that are made available by others (internal, external, primary or secondary data)  It is similar to historical or observed data.  The availability of the data depends on the primary and secondary resources of document, evidence that includes interviews, observation method, minutes of meeting, formal policy statement etc.  Example: Rainfall data collected from Malaysian Meteorological Department is a secondary data. SZS2017 1.2.2 Deciding on the Method of Data Collection  Data could be obtained in 3 ways: 2) Data resulting from an experiment (experimental study):  In an experimental study, the researcher manipulates one of the variables and study on how the manipulation influences other variables provided that the treatment and the subjects are assigned to groups randomly.  Example: Blood glucose level data obtained from diabetic patients before and after a treatment is an example of experimental data. 3) Data collected in an observational study (observation, survey, questionnaire):  Observations VS interviews SZS2017
  • 8. Observation method  In qualitative research: used to study the behaviours or events and the context that surrounds the behaviours or events and between the behaviour and the event.  In quantitative research: used to collect data regarding the number of occurrences in a specific period of the time, or duration of a very specific behaviour or event.  The detail descriptions or data collected in qualitative research can be converted later to numerical data and can be analysed quantitatively.  Observations method can be used in setting the physical environment, social interactions, physical activities, non-verbal communications, planned and unplanned activities.  Example: A study on customer’s behaviour towards type of brands in a certain shopping complex is an example of observational study. SZS2017 Interviews method  The purpose of interview in collecting data is to find out what is in or on someone else’s mind.  Interview data can easily become biased and misleading if the interviewed person is aware of the perspective of the interviewer.  It is very important to make sure the person being interviewed does not hold any preconceived notions regarding the outcome of the study.  Interviews range from quite informal and completely open-ended to very formal with the questions predetermined and asked in a standard manner.  Usually, interviews are used to gather information regarding an individual’s experience and knowledge; his/her opinions, beliefs, and feelings, and demographic data.  Example: An interviewer is interested to gather information on the way nurses organise their care in hospital wards and conduct an interview session. SZS2017 Other Methods of Data Collection • Questionnaires and surveys (Quantitative + Qualitative). • Opinions (Qualitative + Quantitative). • Projective technique and psychological tests (both). • Proxemics – Study of people’s use of space and their relationship to culture. • Kinetics – Study of body movement or people communicate nonverbally. • Street Ethnography – Concentrate on a person becoming a part of the place under study. • Narratives – Study people’s individual life stories. • Triangulation – The used of multiple data collection techniques (Triangulation of data permits the verification and validation of qualitative data. SZS2017 EXERCISE 1.2.2 Identify each of the following studies as being either observational or experimental. a) Subjects were randomly assigned to two groups, and one group was given a herb and the other group a placebo. After 6 months, the numbers of respiratory tract infections each group were compared. b) A researcher stood at a busy intersection to see if the colour of an automobile a person drives is related to running red lights or not. c) A researcher finds that people who are more hostile have higher total cholesterol levels than those who are less hostile. d) Subjects are randomly assigned to four groups. Each group is placed on one of four special diets—a low-fat diet, a high-fish diet, a combination of low-fat diet and high-fish diet, and a regular diet. After 6 months, the blood pressures of the groups are compared to see if diet has any effect on blood pressure or not. SZS2017
  • 9. 1.2.3 Collecting the Data (Sampling Techniques)  Sampling is a process of selecting few samples from a population to become the basis for estimating or predicting the prevalence of an unknown piece of information, situation or outcome regarding the bigger group. i. Non-probability sampling (judgment, voluntary, convenience): • Sample collected based on the judgment of the experimenter. • Resulting samples might be biased. ii. Probability sampling (random, systematic, stratified, cluster): • The chances is known before the sample is picked. • Resulting samples are unbiased.  Each collected data from a sampling process can be classified either as a non-probability data or probability data. SZS2017 Sampling Techniques Nonprobability sampling Judgment Voluntary Convenience Others Snowball Quota Probability sampling Random Systematic Cluster Stratified Others Multi-stage K-Sampling Nested SZS2017 A. Nonprobability Sampling Methods Non-probability Sampling Methods Example Judgment sampling Data is selected based on opinion of one or more experts. A political campaign manager intuitively picks certain voting districts as reliable places to measure the public opinion of his candidates. Voluntary sampling Questions are posed to the public by publishing them over radio or television via phone, short message, email etc. The resulting sample tends to over represent individuals who have strong opinions. A call-in radio show asks their listeners to participate in surveys on controversial topics such as abortion, affirmative action, gun control, politic, etc. Convenience sampling The data selected is an “easy sample”, haphazard or accidental sampling. The researcher obtains units or people who are most conveniently available. A surveyor will stand in one location and ask passerby the questions. SZS2017 B) Probability Sampling Methods 1. Random sampling • Each data is numbered, and then the data is selected using chance or random method such as random number. • When a sample is chosen at random, it is said to be an unbiased sample. • Random sample can be selected with or without replacement. Example: Suppose a lecturer wants to study the physical fitness levels of students at his/her university. There are 5000 students enrolled at the university, and he/she wants to draw a sample of size 100 to take a physical fitness test. She could obtains a list of all 5000 students, numbered it from 1 to 5000 and then randomly invites 100 students corresponding to those numbers to participate in the study. SZS2017
  • 10. Generating Random Number • Generating random number is an important step in obtaining random sample. • In random number, each number has equal chance to be selected. • Random number can be generated from calculator, softwares, or random number table. • As example, suppose we have data numbered from 1 to 100 and we want to choose five samples only. Hence, using R-language we can use the R command “sample (1: 100, 5)”. The resulted output is the five number listed randomly. SZS2017 B) Probability Data Samples 2. Systematic sampling • A set of data is numbered from 1 to N . • The first data is selected randomly within number 1 and k where k=N/n and n sample size. • The next number are selected every k interval to produce n samples. Example: Suppose a lecturer wants to study the physical fitness levels of students at his/her university and he/she wants to draw a sample of size 100 to take a physical fitness test. She obtains a list of all 5000 students, numbered it from 1 to 5000 and randomly picks one of the first 50 voters (k=5000/100) on the list. If the first picked number is 30, then the 30th student in the list should be invited first. Then she should invite every 50th name on the list after this first random number starts (the 80th student, the 130th student and so on) to produce 100 samples of students to participate in the study.   1 2 , , , N x x x SZS2017 B) Probability Data Samples 3. Stratified sampling • The population is divided into groups according to some characteristic that is important to the study, and then the sample is selected from each group using random or systematic sampling. • The characteristics are homogeneous (similar) within each group but heterogeneous (dissimilar) among the groups Example: Assume that, because of different lifestyles, the level of physical fitness is different between male and female students. To account for this variation in lifestyle, the population of student can easily be stratified into male and female students. The random method or systematic method can be used to select the participants. As an example, she use random sample to choose 50 male students and use systematic method to choose another 50 female students or otherwise. SZS2017 B) Probability Data Samples 4. Cluster sampling • The population is divided into groups or clusters, then some of those clusters are randomly selected and all members from those selected clusters are chosen. • Cluster sampling can reduce cost and time. • Each cluster has heterogeneous characteristic but has homogeneous characteristic among the clusters. • We can choose more than one cluster. Example: Assume that, because of different lifestyles, the level of physical fitness is different between 1st year, 2nd year, 3rd year and senior students. To account for this variation in lifestyle, the population of student can easily be clustered into four categories. Then, she can choose any clusters and chose all students in that clusters as the participants. For example, all 2nd year students are chosen as the participants. SZS2017
  • 11. Advantages and Disadvantages for each Sampling Techniques Sampling Techniques When to Use? Advantages Disadvantages Judgement Sampling When the population is too large. - Fast and conclusive. - Biased since it based on opinion of one or more expert only. Voluntary Sampling When the members of the population are convenient to be sampled. - Fast response. - Easy to obtain lager sample sizes. - Samplings are too random. - Sometimes not reliable. - Degree of generalisability is questionable. Convenience Sampling When the members of the population are convenient to be sampled. - Fast and easy. - Convenience and inexpensive. - Samplings are too random. - Sometimes not reliable, - Degree of generalisability is questionable. SZS2017 Advantages and Disadvantages for each Sampling Techniques Sampling Techniques When to Use? Advantages Disadvantages Random Sampling When the members of the population are similar to one another on important variables. - Use table of random number. - Each data has an equal chance to be selected. - Ensures a high degree of representativeness. - High cost. - Time consuming for large sample size. - Tedious. Systematic Sampling When the members of the population are similar to one another on important variables - Relatively easy to construct, execute, compare and understand. - The process can be controlled. - Good for tight budget research. - Ensures a high degree of representativeness. - No need to use a table of random number. - There is a risk of data manipulation. - Not the best method if the researcher does not know the background of the population. - Less random than simple random sampling. SZS2017 Advantages and Disadvantages for each Sampling Techniques Sampling Techniques When to Use? Advantages Disadvantages Stratified Sampling When the population is heterogeneous and contains several different groups, some of which are related to the topic of the study. - Variety of samples. - Ensures a high degree of representativeness of all the strata or layers in the population. - Time consuming. - Tedious. Cluster Sampling When the population consists of units rather than individuals. - Less energy and money. - Easy and convenient. - Save time. - Possibly, members of units are different from one another, decreasing the techniques effectiveness. SZS2017 Random Data Generation From Normal Distribution 𝑋~𝑁 𝜇, 𝜎2 𝑜𝑟 𝑍~𝑁 0, 1 𝜇 is mean 𝜎2 is variance SZS2017
  • 12. Random Data Generation From Poisson Distribution X~Po λ , λ is average value SZS2017 EXERCISE 1.2.3 In each of these statements, identify the type of sampling method used. a) Suppose a researcher has a list of 1000 registered voters in a community and he wants to pick a probability sampling of 50 samples. He uses a random number table to pick one of the first 20 voters (1000/50 = 20) on the list. The table gave him the number of 16, so he selects the 16th voter on the list as the first selected number. Then he picks every 20th name after the first random number start (the 36th voter, the 56th voter, etc.) until 50 samples obtained. b) In a consumer survey of large cities, a researcher divides a map of the city into small blocks. Each block containing a cluster is surveyed. A number of clusters are selected for the sample, and all the households in a cluster are surveyed. Less energy and money are expended if an interviewer stays within a specific area rather than traveling across stretches of the cities. SZS2017 EXERCISE 1.2.3 In each of these statements, identify the type of sampling method used. c) Researchers or farm managers may be called in when a crop shows a certain growing pattern or when surface differences are observed for a soil. For example, differences may occur in soil color which may be the result of many factors. A researcher is called to judge a particular shade of colour to be typical for a sample at certain sites. Then from these sites, samples are drawn. d) The population of university professors is divided into groups according to their rank (instructor, assistant professor, etc.) and several are selected from each group to make up a sample. e) A surveyor stands outside a shop in the East Cost Mall and randomly selects people to participate in a quiz. f) A quality engineer wants to inspect rolls of wallpaper in order to obtain information on the rate at which flaws in the printing are occurring. She decides to draw a sample of 50 rolls of wallpaper from a day’s production. At the end of each hour, for 5 consecutive hours, she takes the 10 most recently produced rolls and counts the number of flaws on each. SZS2017 MIND EXPANDING EXERCISES 1. Statistics can be applied across many disciplines or any fields of research and almost in all fields in human endeavour. Based on this statement, suggest reasons why statistics is important. 2. Is a large sample necessarily a good sample? Why or Why not? 3. Suppose you have been hired by a radio station in Malaysia to determine the age distribution of their listeners. Describe in detail how you would select at least 3000 sample of listeners. Chose the best sampling techniques and state the reason. The sampling techniques can be mix or combine. SZS2017
  • 13.  In this step, the collected data are organised properly for further study and investigation.  Data that has been collected during the sampling process is called raw data.  The simplest way to organise raw data systematically is by using data array. Data array is an arrangement of data items in either ascending or descending order (sorting). 1.2.4.1 Classifying  identify items with the same characteristics & arranging them into groups or classes.  Data could be classified by its type or by its level of measurement. 1.2.4.2 Summarisation  Graphical & Descriptive statistics ( tables, charts, measures of central tendency, measures of variation, measures of position) 1.2.4 Classifying and Summarising the Data SZS2017 Example of Raw Data Data can be organised by column or row SZS2017 1.2.4.1 Data Classification  Data are the values that variables can assume.  Variables is a characteristic or attribute that can assume different values.  Variables whose values are determined by chance are called random variables. Data can be classified By how they are categorized, counted or measured - Level of measurements of data As Quantitative or Qualitative type SZS2017 Qualitative (categorical/Attributes)  Data that refers to classification name according to some characteristic or attribute  Data is classified using code numbers Quantitative (Numerical)  Data can be counted or measured  Data can be ordered or ranked Nominal Data The values cannot be ranked Gender, race, citizenship, colour, etc. Ordinal Data The values can be ranked and likert scale is used Feeling (dislike-like), colour (dark-bright), etc. Discrete Data The values can be counted and finite Number of student, number of cat, number of defect, etc. Continuous Data The values can be placed within two specified values, obtained by measuring, have boundaries, and shall be rounded to require decimal places Weight, age, salary, temperature, etc. Use code numbers (1, 2,…) Type of Data SZS2017
  • 14. Levels of Measurement of Data Levels Descriptions Examples Nominal-level Classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or ranking can be imposed on the data. Zip code (4, 5, 6,…), Post code (25000, 25600, …), Gender (female, male), Eye colour (blue, brown, green, hazel), Political affiliation, Religion, Nationality Ordinal-level Classifies data into categories that can be ranked; however, any specific differences between the ranks do not exist. Grade (A, B, C, D, etc.), Judging (first place, second place, etc.), Rating scale (poor, good, excellent). Color (light blue, …, dark blue) Interval-level Ranks the data, and precise differences between units of measure do exist; however, there is no meaningful zero. IQ test Temperature Shoe size Ratio-level Possesses all the characteristics of interval measurement, and there exists a true zero. Height, Weight, Time, Salary SZS2017 1. The SuperMotor Marketing Corporation has asked you for information about the car you drive. For each question, identify each of the types of data requested as either attribute data or numeric data. When atribute data is requested, identify the variable either as nominal or ordinal. When numeric data is requested, identify the variable either as discrete or continuous. Then, identify the level of measurement for each variable. a) What is the weight of your car? b) In what city was your car made? c) How many people can be seated in your car? d) What is the distance traveled from your home to your school? e) What is the color of your car? f) How many cars are in your household? g) What is the length of your car? h) What is the normal operating temperature (in C) of your car’s engine? i) How much does the petrol mileage (km/l) do you get in city driving? j) Who made your car? k) How many cylinders are there in your car’s engine? l) How many kilometres have you put on your car’s current set of tyres? EXERCISE 1.2.4.1 SZS2017 2. The chart shows the number of job-related injuries for each of the transportation industries for 1998. a) What are the variables under study? b) Categorise each variable either as qualitative or quantitative. c) Categorise each quantitative variable either as discrete or continuous. d) Categorise each qualititative variable either as nominal or ordinal. e) Identify the level of measurement for each variable. Type of transportation Industries Number of job related injuries Railroad 4520 Intercity bus 5100 Subway 6850 Trucking 7144 Airline 9950 EXERCISE 1.2.4.1 SZS2017 1.2.4.2 Data Summarisation 1) Descriptive statistics (refer Section 1.3)  Typically used to confirm conjectures about the data.  Quantitative data: measures of central tendency, measures of variation (dispersion) and measures of position.  Qualitative data (non-numeric quality (attribute) or category): measure the relative frequency for a particular characteristic and calculate its percentage. b) Graphical Summary  Organise the data in some meaningful way by constructing a frequency distribution (refer Appendix A.1) for quantitative or qualitative data.  A frequency distribution is the organisation of raw data in table form, using classes and frequency SZS2017
  • 15. Graphical Statistics The purpose of graphs in statistics is to convey the data to the viewer in pictorial form and getting the audience’s attention in a publication or a presentation. Histogram Frequency Polygon Ogive Bar Chart Pareto Chart Pie Chart Time Series Graph SZS2017 Histogram, Frequency Polygon, Ogive Histogram  For quantitative data.  Describe grouped frequency data distribution.  Displays the data by using contiguous vertical bars of various heights to represent the frequency of the classes. Frequency Polygon  For quantitative data.  Describe grouped frequency data distribution.  Displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes.  The frequencies are represented by the heights of the points. Ogive  For quantitative data.  Represents the cumulative frequencies for the classes in a grouped frequency data distribution.  Visually represent how many values are below a certain upper class boundary. Distribution Shapes for Histogram Bell-Shaped Uniformed J-Shaped Reverse J-Shaped Right Skewed Left Skewed Bimodal U-Shaped SZS2017 Bar Chart, Pareto Chart, Pie Chart Bar Chart  For quantitative data, the bar represents the mean values.  For qualitative data, the bar represents the heights or length whose represents the frequencies of the data.  The bars can be vertical or horizontal. Pareto Chart  Used to represent a frequency distribution for a categorical variable.  The frequencies are displayed by the heights of vertical bars which are arranged in decreasing order. Pie Chart  A circle that is divided into sections or wedges according to percentage of frequencies in each category of the distributions.  Pie charts show the relationship between classes in a set of data with the whole data.
  • 16. Stem and Leaf Plot, Time series graph Time Series Graph  Represents data that occur over a specific period of time.  For analysis, we look at the trend or pattern (increasing or decreasing) that occurs over the time period.  Further analysis will look at the slope or the steepness of the line (rapid increase or decrease). Stem and leaf plot  The leading digit is plotted as the stem and the trailing digit as the leaf to form groups or classes.  A key indicator is used to define the stem and leaf values.  If the plot is rotated in horizontal position, we can see the shape of the data distribution  For a mixture stem and leaf plot, the shape of distribution for the left side may be seen by reflecting the plot to the right side.  We may analyse the variability of the data by looking at the spread of the stem and leaf plot.  A stem and leaf plot is also good in showing the range, minimum, maximum, mode, gaps, clusters, and outliers. Selection of appropriate statistical techniques for data summarisation Type of Data Descriptive Statistics Graphical Summary Quantitative (ratio scale) Mean, Median, Mode, Range, Standard Deviation, Interquartile range (IQR =Q3-Q1) Histogram, Bar Chart (bar representing means), stem and leaf plot, Boxplot Symmetrical Distribution Mean, Median, Mode, Range, Standard Deviation Histogram, Bar Chart (bar representing means) Skewed Distribution Median, Range, Interquartile range (IQR =Q3-Q1) Histogram, Stem and leaf plot, Boxplot Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart Categorical (Ordinal, Likert Scale) Mode, Mean, Counts, Percentage Pie Chart, Bar Chart SZS2017 1.2.5 Presenting and Analysing the Data  Analysed information given by the  Descriptive statistics (refer topic 1.3)  Graphical summary (graph and chart)  Identify if there exist any relationship in the variables under study.  Making any relevant statistical inferences  confidence interval, hypothesis testing, ANOVA, goodness of fit test, contingency table, regression, correlation, etc. SZS2017 BASIC INFERENTIAL STATISTICS Statistical Analysis Characteristics Confidence Intervals (CHAPTER 2) An estimated range of values which is likely to include an unknown population parameter, 𝜃 with a specified probability (confidence level) within that interval. The interval is usually written as 𝒂, 𝒃 or 𝒂 < 𝜽 < 𝒃. Hypothesis Testing (CHAPTER 3) A statement (claim or conjecture or assertion) concerning a parameter or parameters of one or more populations. • Statistical Analysis for one population (mean, variance, proportion) • Statistical Analysis for two populations (mean, variance, proportion) Analysis of Variance (ANOVA) (CHAPTER 4) Statistical Analysis for three or more populations mean • One-way ANOVA • Two-way ANOVA and Post Hoc Test Linear Regression Analysis (CHAPTER 5) A statistical measure that attempts to determine the strength of relationship between dependent (y) and independent variables (x). • Simple linear regression analysis and correlation. (y vs x) • Multiple linear regression analysis and correlation. (y vs xi) • Model selection technique to chose a parsimony model that best fit the data. Statistical Analysis for Categorical Data (CHAPTER 6) 1. Tests concerning frequency distributions for categorical data (Goodness of Fit) 2. Tests concerning specific probability distributions (Goodness of Fit) 3. Test the Independence of two variables (Contingency Table) 4. Test the homogeneity of proportions (Contingency Table)
  • 17. ADVANCED INFERENTIAL STATISTICS Statistical Analysis Characteristics Experimental Design (DOE) Planning, conducting, analysing and interpreting controlled tests to evaluate the factors that control the value of a parameter or group of parameters. Example: ANOVA, Single factor experiment, Randomized Blocks, Latin Squares and Related Design, Factorial Design, Response Surface Methodology, Nested and Split-Plot Design Time Series Analysis Modelling, making inference and producing forecast time series data for future observations. Time series models are built to represent the serially correlated series, trends, or seasonal effects. Example: Linear Time Series, Linear Stationary Models (AR, MA, ARMA), Linear Nonstationary Models (ARIMA, SARMA), Box-Jenkins Models, Volatility Models (ARCH, GARCH), Hybrid models Multivariate Analysis A central tool whenever many variables need to be considered at the same time. Example: Mean Vector and Covariance Matrix Estimation, MANOVA, Principal Component Analysis, Factor Analysis, Canonical Correlation Analysis, Discriminant Analysis, Cluster Analysis Statistical Quality Control (SQC) Quality improvement through the use of modern statistical methods for quality control Example: Variables control charts, Attribute Control Charts, Time-Weighted Control Charts, Multivariate Control Charts ADVANCED INFERENTIAL STATISTICS Statistical Analysis Characteristics Statistical Modelling A mathematical equations that relate one or more random variables and possibly other non-random variables, concerning the generation of some sample data and similar data from a larger population. • Example of Statistical Models: Generalised Linear Model, Dependence model, Regression, Bayesian, markov chain, Random effect and mixed model • The Process involve: parameter estimation, data generation, missing values, outlier detection, simulation study, bootstrap, goodness of fit test Data Mining A computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database system. Example: Decision Tables, Decision Trees, Classification Rules, Association Rules, Decision Tress, Clustering, Advanced linear model, Bayesian, Instance-based Learning Circular Statistics A branch of statistics that involve circular data which deal with direction or cyclic time. Circular data are measured in degrees (0,2π] or radian (0o, 360o]. Example: orientation of an animal, direction of wind and wave, days of the week, compass direction, waves of sound, the human perception under various conditions, the orientation of ridges of fingerprints, the orientation of sand grains from a beach, the death due to a disease at various times in a year, and astronomical observations. ADVANCED INFERENTIAL STATISTICS Statistical Analysis Characteristics Advanced Regression Analysis • Polynomial Regression: y is modelled as an nth degree polynomial in x • Multivariate Regression: Y is a matrix with series of multivariate dependent measurements and X is a matrix of observations on independent variables. • Generalized Linear Model: A flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. • Logistic Regression: A regression model where the dependent variable is categorical. • Nonlinear Regression: The observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables • Error in Variables: a regression model that account for measurement errors in the independent variables. 1.2.6 Make the decision and conclusion  The researchers can make decisions in order to achieve the objective and goal of the research and choose the best options which represents the ‘best’ solution to the problem.  The correctness of this choice depends on the analytical skill of the researchers and quality of the information. SZS2017
  • 18. 1.3 REVIEWS ON DESCRIPTIVE STATISTICS  Summarise the data using measures of central tendency, such as the mean, median, mode, and midrange.  Describe the data using measures of variation, such as the range, variance, standard deviation and coefficient of variation.  Identify the position of a data value in a data set using measures of position such as quartiles, deciles, and percentiles. SZS2017 Reviews on Descriptive Statistics  Descriptive statistics is typically used to confirm conjectures about the data.  We can summarise data using measures of central tendency, measures of variation, and measures of position.  Some classified these type of measures as traditional statistics.  If the measurement describes about a population characteristic, it is called a parameter.  If the measurement describes about a sample characteristic, it is called a statistic. SZS2017 RULE OF THUMB FOR DECIMAL PLACES 1. In general, the calculated parameter or statistic value should be rounded to four (4) decimal places. 2. If the unit is given (in cm, minute, day, etc.), the value should be rounded to that unit’s decimal places. SZS2017 TIPS: Descriptive Statistics using Scientific Calculator Note: The notations used in the calculator are n as sample size, x as mean sample, n x or x  as population standard deviations, and 1 n x  or sx as sample standard deviations. Casio fx-570MS STEP 1: Insert data → MODE, SD, insert data, M+, AC STEP 2: Data summary Shift 1 → Shift 2 → STEP 3: Clear data → Shift CLR 1 Casio fx-570ES STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC STEP 2: Data summary: Shift 1 → 3: Sum → Shift 1 → 4: Var → STEP 3: Clear data → Shift 9 SZS2017
  • 19. 1.3.1 Measures of Central Tendency  Measures of central tendency are also called measures of average 1. mean 2. median 3. mode, and 4. midrange.  The measures of central tendency are use to describe an entire set of observations with a single value representing the central or middle value of the data set. Can roughly describes the shape of distribution of a certain data set SZS2017 Is a rough estimate of the middle lowest value highest value MR 2   Midrange (MR) EXAMPLE 1.3: If the data set is 1, 3, 5, 7, 7, 8, then the calculated midrange is, 1 8 4.5 2 MR    . Properties of Midrange  A rough estimate of the average  Can be affected by one extremely high or low value (outlier). SZS2017 Mean Is the sum of the values divided by the total number of values Population Mean Sample Mean 1 , population size N i i x N N     1 , sample size n i i x x n n    If the data set is 1, 3, 5, 7, 7, 8, then ‒ the calculated mean is 5.1667   if the data is taken from the population. The value is a true mean or a parameter. ‒ the calculated mean is 5.1667 x  if the data is taken from the sample. The value is a sample mean or a statistic. SZS2017 RECALL: Descriptive Statistics using Scientific Calculator Note: The notations used in the calculator are n as sample size, x as mean sample, n x or x  as population standard deviations, and 1 n x  or sx as sample standard deviations. Casio fx-570MS STEP 1: Insert data → MODE, SD, insert data, M+, AC STEP 2: Data summary Shift 1 → Shift 2 → STEP 3: Clear data → Shift CLR 1 Casio fx-570ES STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC STEP 2: Data summary: Shift 1 → 3: Sum → Shift 1 → 4: Var → STEP 3: Clear data → Shift 9 SZS2017
  • 20. Is the middle number of n ordered data (smallest to largest) If n is odd If n is even 1 2 Median(MD) n x   1 2 2 Median(MD) 2 n n x x    Median If the data set is 1, 3, 5, 6, 7, then the calculated median is, 3 Median 5 x   . If the data set is 1, 3, 5, 7, 7, 8, then the calculated median is, 3 4 Median 6 2 x x    . SZS2017 Is the most commonly occurring value in a data series Mode EXAMPLE 1.4: a) If the data set are 1, 6, 3, 7, 8, 5 then the mode is not exist. b) If the data set are 1, 6, 3, 7, 8, 3, 5 then the mode is 3. c) If the data set are 1, 6, 3, 7, 3, 8, 7, 5, 3, 7 then the mode is 3 and 7.  The mode is used when the most typical case is desired.  The mode is can be used when the data are nominal.  The mode is not always unique.  A data set can have more than one mode, or the mode may not exist for a data set. Properties of Mode SZS2017 Identify the Shapes of Data Distribution Symmetric Positively skewed / right-skewed Negatively skewed/ left-skewed Mean Median Mode   Mean Median Mode   Mean Median Mode   →In reality, median can be greater than mode or mean values. →The shape of the distribution may be identified by observing the position of the mode value. SZS2017 EXAMPLE 1.3 If the data set is 1, 3, 5, 7, 7, 8, then ‒ the calculated mean is 5.1667   if the data is taken from the population. The value is a true mean or a parameter. ‒ the calculated mean is 5.1667 x  if the data is taken from the sample. The value is a sample mean or a statistic. ‒ the calculated median is, 3 4 Median 6 2 x x    . ‒ the mode is 7. ‒ the shape of distribution is negatively skewed since Mean Median Mode   . SZS2017
  • 21. RECALL: Descriptive Statistics using Scientific Calculator Note: The notations used in the calculator are n as sample size, x as mean sample, n x or x  as population standard deviations, and 1 n x  or sx as sample standard deviations. Casio fx-570MS STEP 1: Insert data → MODE, SD, insert data, M+, AC STEP 2: Data summary Shift 1 → Shift 2 → STEP 3: Clear data → Shift CLR 1 Casio fx-570ES STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC STEP 2: Data summary: Shift 1 → 3: Sum → Shift 1 → 4: Var → STEP 3: Clear data → Shift 9 SZS2017  The mean is unique, and not necessarily one of the data values.  The mean is affected by extremely high or low values and if it occurs, the mean may not be the appropriate average to use.  The mean is used in computing other statistics, such as variance.  The mean cannot be computed for an open ended frequency distribution.  The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples.  The mean is not an appropriate average to use if the shape of distribution is skewed.  The median is used when one must find the center or middle value of a data set.  The median will make sure that the data values fall into the upper half or lower half of the distribution.  The median is affected less than the mean by extremely high or extremely low values. Properties of Mean and Median SZS2017 EXAMPLE 1.5 An extreme value, let say 21 is added to the data set in Example 1.3. The new data set are 1, 3, 5, 7, 7, 8, 21. Assume that the data is taken from a sample, then ‒ the calculated mean is 7.4286 or 7.4286 x  . The mean is easily affected by outliers and may not be the appropriate average to use. This new average value is no longer representing the central of the data set. ‒ the calculated median is 7 or 4 Median 7 x   . This new average value is still representing the central of the data set. ‒ the mode is 7. ‒ the calculated midrange is, 1 21 11 2 MR    . The midrange is easily affected by outliers. ‒ the shape of distribution is positively skewed since mode is the smallest value as compared with the mean and median values. An extremely high or low value data that occur in a data set is called outlier. SZS2017 EXERCISE 1.3.1 1. Determine the shape of distribution of the following data. a) Mean = Mode = Median = 11 b) Mean = 25, Mode = 13, Median = 17 c) Mean = 5, Mode = 73, Median = 17 d) 11.4, 11.6,12.6,12.7, 12.8, 13.3, 13.3, 13.6, 13.7, 13.8 SZS2017 a) symmetric b) right-skewed c) left-skewed d) Mean = 12.88, Median = 13.05, mode = 13.3, left-skewed
  • 22. EXERCISE 1.3.1 2. The following set of data represents the number of hospitals for selected countries. 123 108 195 138 115 179 119 148 147 180 146 178 189 108 193 114 179 147 108 128 164 174 128 159 193 175 a) Find the mean, median, mode, and midrange. b) Is the average values calculated in (a), a parameter or a statistic? Why? c) What is the distribution type that describes the data? d) What is the best measure of average of this set of data? Why? SZS2017 a) Mean = 151.3462, Median = 148, mode = 108 b) statistic c) right-skewed d) median 1.3.2 Measures of Variation/Dispersion  Measures of variation or measures of dispersion are measures that determine the spread of data values. 1. Range: the simplest measure of variation 2. Variance, and 3. Standard deviation. 4. Coefficient of Variation  Measures of variation may help researchers to describe data more accurately.  Variance and standard deviation are used quite often in inferential statistics. more meaningful and popular measures that describes the variability of data SZS2017 Is the different between the highest value and the lowest value in a data set R = highest value - lowest value Range (R) Properties of Range  The simplest measure of variation.  Easily affected by one extremely high or low value (outliers). EXAMPLE 1.6: Suppose the data set is 1, 6, 3, 7, 8, 5, then the calculated range is, 8 1 7 R    . SZS2017 Population Variance Sample Variance   2 2 1 , population size N i i x N N         2 2 1 , sample size 1 n i i x x s n n      Is the average of the squares of the distance each value is from the mean. Is the square root of the variance Population standard deviation ,  Sample standard deviation, s   2 1 , population size N i i x N N         2 1 , sample size 1 n i i x x s n n      Variance Standard Deviation SZS2017
  • 23. Properties of Variance & Standard Deviation  The variance is the average of the squares of the distance each value is from the mean.  If the data values are near the mean, the variance will be smaller.  If the data values are far from the mean, the variance will be larger.  The square distance is used since the sum of the distances will always be zero.  Variance is always a positive value.  There is no unit for the resultant variance.  Standard deviation is the square root of the variance.  Standard deviation is measure of deviations of values from the mean.  Standard deviation is always positive value.  The units of standard deviation are similar as the unit of the data. SZS2017 Population CVar Sample CVar CVar 100%, for population     CVar 100%, for sample s x   Is the standard deviation divided by the mean. Coefficient of Variation Properties of CVar  The result is expressed as percentage.  A parameter/statistic that allows user to compare the standard deviations when the units are different (the variables are different). RECALL: Descriptive Statistics using Scientific Calculator Note: The notations used in the calculator are n as sample size, x as mean sample, n x or x  as population standard deviations, and 1 n x  or sx as sample standard deviations. Casio fx-570MS STEP 1: Insert data → MODE, SD, insert data, M+, AC STEP 2: Data summary Shift 1 → Shift 2 → STEP 3: Clear data → Shift CLR 1 Casio fx-570ES STEP 1: Insert data → MODE 3: STAT, 1: 1-VAR, insert data, =, AC STEP 2: Data summary: Shift 1 → 3: Sum → Shift 1 → 4: Var → STEP 3: Clear data → Shift 9 SZS2017 EXAMPLE 1.6 SZS2017 Suppose the data set is 1, 6, 3, 7, 8, 5, then ‒ the calculated range is, 8 1 7 R    . ‒ the calculated variance is 2 5.6667   and the standard deviation is 2.3805   if the data is taken from the population. These values are called as parameters. ‒ the calculated variance is 2 6.8 s  and the standard deviation is 2.6077 s  if the data is taken from the sample. These values are called as statistics. ‒ the calculated sample mean is, 5 x  . Hence the sample coefficient of variation is 2.6077 CVar 100% 52.15% 5    .
  • 24. Why we Need Measures of Variation • Measures of variation can be a judgment about how well the measures of average illustrate or depict the data. • It is also called measure of variation because it can measure the variability that exists in a data set. • It can be used when the measures of central tendency do not give any significant meaning or not needed/practical. EXAMPLE: Suppose we wish to compare the performance of two groups of student in a test. Given that the mean values are the same for both data sets. In short, you might conclude that these two groups of students are equally well performed in the test. However, if the data sets are examined graphically as shown in Figure 1.10, a different conclusion might be drawn. SZS2017 Examining Data Sets Graphically  Both group have same total number of students.  Students are given the same set of test and the mean of score is calculated as 66.67 marks for each group of students.  The mean values are the same but the spread or variation of the test score is quite different.  The test score for students from Group B is more consistent and less variable.  When the mean values are equal, the larger the data range is, the more the variable the data. SZS2017 Comparing Two Data Sets Smaller standard deviation indicate that: POPULATION 1 is POPULATION 2 is  Less dispersed  Less spread  Less variable (small variation)  More consistent  More precise  More accurate  Better data  More dispersed  More spread  More variable (large variation)  Less consistent  Less precise  Less accurate  Worse data   1 2    Same interpretation is applicable for range and variances SZS2017 EXAMPLE 1.7 The following data represents the age (in years) of lecturers in two faculties at UMP. FIST: 24, 25, 26, 27, 30, 31, 31, 32, 36, 40, 43, 44, 45 FKEE: 22, 25, 25, 25, 28, 33, 34, 36, 37, 40, 41, 43, 48, 51, 53 For these sample data sets, find the standard deviations. Then, identify which data set is more consistent and less dispersed. What can you say about the variation of age for lecturers in both faculties? Solution: 7.4670 FIST s  years 9.9460 FKEE s  years FIST FKEE s s  , so FIST data is more consistent and less dispersed. The variation of ages for lecturers in FIST is small and less dispersed as compared to FKEE lecturers. SZS2017
  • 25. 1. Which of the following set of sample data is less variable? Method A: 79 73 78 76 80 75 82 70 77 Method B: 80 85 78 79 75 73 70 60 65 2. The following set of sample data represents the battery lifetime (in hours) from two different brands. Which brand of battery is performed better? A: 4.2, 6.7, 7.3, 7.5, 8.0,8.5, 8.7, 8.8, 9.2, 9.3 B: 9.6, 9.7, 9.8, 9.9, 10.1, 10.2, 11.0, 11.0, 11.0, 11.1 EXERCISE 1.3.2 (Q1&Q2) SZS2017 3.6742 7.8493 A B s s    1.5 hours 0.6 hours A B s s    Comparing Two Data Sets with different units/variable SZS2017  If the two samples do not have the same units of measurement or the variables are different, the variance and standard deviation for each sample cannot be compared directly.  As an example: suppose a car dealer wants to compare the variation between the number of sales of car for a year and the commission (in RM) made by the salesperson. It is very clear that these two variables have two different units.  Hence, the best way to compare the variability within these two variables is by using the coefficient of variation.  It is means that if     1 2 CVar CVar  , then the variable one is less variable than the variable two. 3. The average age of the accountants at a huge company is 31 years with a standard deviation of 4 years. The average salary of the accountants is RM 44255 per year with a standard deviation of RM 780. Compare the variations of age and income. EXERCISE 1.3.2 (Q3) SZS2017         CVar 12.90% CVar 17.63% age income    Other Properties of Standard Deviation  Use to determine the number of data values that fall within a specified interval in a distribution.  The values under curve indicate the percentage of area in each section or range of data.  It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎 and 𝜇 + 2𝜎. SZS2017
  • 26. 1.3.2.1 Accuracy and Precision Concept (Validity and Reliability)  Accuracy is how close a measured value to the ‘true’ measurements.  No measurement/device is perfect (can easily be inaccurate and lead to false measurements). There is still a tolerance for error.  Accuracy must be accounted for in your results.  The bigger the difference between the measured and the true values, the less accurate (less valid) the measurement.  Precision is how close the measured value to each other or how consistent your results are for the same phenomena over several measurements.  Precision as a measure of variation must be accounted in your calculations and results.  The precision of a measurement is the size of unit used to make a measurement. The smaller the unit, the more precise (more reliable) the measurement. → The concept is important to ensure that data collected from an experiment or observation is good, valid, and reliable. SZS2017 Game of Darts • A very accurate (close to the mark) measurements, but not very precise, since the darts are spread out everywhere • Valid but not reliable • Precision without accuracy • Very consistent, but not near the mark • Not valid but reliable • Inaccuracy and imprecision • Not valid and not reliable • Accurate and precise. • Valid and reliable • Very good measurement SZS2017 EXERCISE 1.3.2 (Q4) 4. Identify each situation as either accurate or precise or both. a) If you are playing football and you always hit the left goal post instead of scoring. b) A candy manufacturer claims that each packet contains 20 candies. A sample of packet have 18, 21, 19, 21, 19, 20, 22 candies, respectively. The average is 20 candies with an error of 1 candy. c) A manufacturer claims that each chocolate packet contains 20 chocolates. A sample of packets have 17, 18, 18, 17, 18, 17, 17 chocolates, respectively. d) In an experiment, with five trials, the end results of the five trials for whatever is being tested are: 35 kg, 36 kg, 36 kg, 35 kg, 36 kg. The actual value (as found in a scientific data book) is meant to be 42 kg. e) In an experiment, with five trials, the average value is 35 kg. The actual value (as found in a scientific data book) is meant to be 35 kg. SZS2017 MIND EXPANDING EXERCISES 4. In what sense are the mean, median, mode and midrange measures the “centre”? of a data set? 5. Which do you think has more variation: the IQ scores of 30 students in a statistics class or the IQ scores of 30 teenagers watching a movie? Why? 6. Explain why median and interquartile range are more appropriate measures as compared to mean and variance for non-normal data. 7. A JDT football fan records the number on the jersey of each player in a game. Does it makes sense to calculate the mean of those numbers? Why or why not? SZS2017
  • 27. MIND EXPANDING EXERCISES 8. In an analysis of the accuracy of weather forecasts, the actual high temperature are compared to the high temperatures predicted one day earlier and the temperatures predicted five days earlier. Listed below are the errors between the predicted temperatures and the actual high temperatures for 14 consecutive days in Kuala Lumpur. a) Do the means and medians of the errors indicate that the temperatures predicted one day in advance are more accurate than those predicted five days in advance, as we might expect? b) Do the standard deviations of the errors indicate that the temperatures predicted one day in advance are more accurate than those predicted five days in advance, as we might expect? SZS2017 Actual high ‒ High predicted one day earlier 2 2 0 0 ‒ 3 ‒ 2 1 ‒ 2 8 1 0 ‒ 1 0 1 Actual high ‒ High predicted five days earlier 0 ‒ 3 2 5 ‒ 6 ‒ 9 4 ‒ 1 6 ‒ 2 ‒ 2 ‒ 1 6 ‒ 4 ME.8 (solution) SZS2017 Mean median sd 1.5000 1.0000 2.4152 3.8333 4.5000 2.4014 MIND EXPANDING EXERCISES 9. A data set consists of 20 values that are fairly close together. Another value is included, but this new value is an outlier (very far away from the other values). How is the standard deviation affected by the outlier? No effect? A small effect? Or a large effect? 10. Suppose scores on psychological test have a mean of 90 and a standard deviation of 10. Meanwhile, scores on the economics test have a mean of 55 and a standard deviation of 5. Which is relatively better: a score of 85 on a psychological test or a score of 45 on an economics test? 11. When designing the production procedure for batteries used in heart pacemakers, an engineer specifies that “the batteries must have a mean life greater than 10 years, and the standard deviation of the battery life can be ignored.” If the mean battery life is greater than 10 years, can the standard deviation be ignored? Why or why not? SZS2017 1.3.3 Measures of Position Describe where a specific data value falls within the data set or its relative position based on percentiles, deciles and quartiles in comparison with other data values SZS2017 Describing the position of the data value (increasing order) Percentiles Split data into 100 equal parts Deciles Split data into 10 equal parts Quartiles Split data into 4 equal parts 4 i in c Q x x   10 i in c D x x   100 i in c P x x  
  • 28. 4 i in c Q x x   10 i in c D x x   100 i in c P x x    If c is not a whole number, round it up to the next whole number.  If c is a whole number, then use 1 1 1 , , 2 2 2 c c c c c c i i i x x x x x x Q D P          SZS2017 EXAMPLE 1.9 The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45 Quartiles Percentiles   1 2.75 3 1 11 4 27 Q x x x       2 5.50 6 2 11 4 36 Q x x x       3 8.25 9 3 11 4 42 Q x x x       25 2.75 3 25 11 100 27 P x x x       50 5.50 6 50 11 100 36 P x x x       75 8.25 9 75 11 100 42 P x x x     Summary: 1 Q equivalent to 25; P 2 Q equivalent to 50; P 3 Q equivalent to 75. P A manufacturer measured the volume of a sample of 11 bottles of chemical solvents. The results are recorded (in millilitres) as follows. 40 45 38 25 42 31 30 44 26 27 36 SZS2017 Show that 1 Q equivalent to 25, P 2 Q equivalent to 50, P 3 Q equivalent to 75 P , and i D equivalent to (10), i P where 1, 2, , 9 i  . EXAMPLE 1.9 The dataset in increasing (ascending) order: 25 26 27 30 31 36 38 40 42 44 45 Deciles Percentiles   3 3.3 4 3 11 10 30 D x x x       5 5.5 6 5 11 10 36 D x x x       7 7.7 8 7 11 10 40 D x x x       30 3.3 4 30 11 100 30 P x x x       50 5.5 6 50 11 100 36 P x x x       70 7.7 8 70 11 100 40 P x x x     Summary: i D equivalent to (10) , i P where 1, 2, 3, 4, 5, 6, 7, 8, 9 i  . SZS2017 EXERCISE 1.3.3 1. Given a set of data as 9 2 1 4 3 7 5 4 6 . a) Find the value corresponds to 4th deciles. b) Find the value corresponds to 3rd quartiles. 2. A teacher gives a 25-point test to ten students. The scores are shown below. 9 22 11 14 13 3 7 15 18 16 a) Find the score corresponds to 20th percentiles. b) Find the score corresponds to 7th deciles. SZS2017 1) 4, 6 2) 8, 15.5
  • 29. Why We need Measures of Position?  Percentiles are one of measures of position that often used in educational and health related fields to indicate the position of an individual in a group.  Percentile is not a percentage value. The ith percentile, is a value that i % of the data are less than or equal to Pi and (100-i) % are greater than or equal to Pi. EXAMPLE: If a student obtained 82 marks over 100 in a test , he/she will obtain 82% of score. However, there is no indication of his/her position with respect to the rest of the class. On the other hand, if his/her score corresponds to the 75th percentile, then he/she did better than 75% of the students in his/her class. SZS2017 Why We need Measures of Position? Quartiles can be used as a rough measurement of variability. INTERQUARTILE RANGE (IQR)  defined as the difference between Q1 and Q3 and is the range of the middle 50% of the data.  used to identify outliers, and to measure variability in exploratory data analysis (Section 1.4).  the smaller the value of IQR; the smaller the variation in the data.  useful to show the variability of the data set, either its more variation, more dispersed, more spread or more consistent. SZS2017 MIND EXPANDING EXERCISES 4. In what sense are the mean, median, mode and midrange measures the “centre”? of a data set? 5. Which do you think has more variation: the IQ scores of 30 students in a statistics class or the IQ scores of 30 teenagers watching a movie? Why? 6. Explain why median and interquartile range are more appropriate measures as compared to mean and variance for non-normal data. 7. A JDT football fan records the number on the jersey of each player in a game. Does it makes sense to calculate the mean of those numbers? Why or why not? SZS2017 1.3.4 Descriptive Statistics Using Microsoft Excel SZS2017
  • 30. Interpreting Descriptive Statistics Using Microsoft Excel (Example 1.9) A firm is conducting a study to compare two different physical arrangements of its assembly line. The arrangement with the smaller variance in the number of finished units produced per day will be adopted as the new arrangement of its assembly line. → 1 2 , x x  in average Assembly Line 2 produced more number of finished units per day. →     1 2 1 2 1 2 , and s.e s.e . s s R R    The arrangements of Assembly Line 1 is more consistent, less dispersed, less spread, less variable (small variation), and more precise. Therefore the arrangements of Assembly Line 1 will be adopted as the new arrangement. → For Assembly Line 1, the distribution of data is negatively skewed or left-skewed since Mean Median Mode   . The skewness value is negative too. → For Assembly Line 2, the distribution of data is also negatively skewed or left-skewed since the mode is the highest value compared to mean and median. The skewness value is negative too. SZS2017 Interpreting Descriptive Statistics Using Microsoft Excel (Example 1.9) A firm is conducting a study to compare two different physical arrangements of its assembly line. The arrangement with the smaller variance in the number of finished units produced per day will be adopted as the new arrangement of its assembly line. → The skewness value for Assembly Line 2 is higher that the Assembly Line 1. Hence the distribution of data from Assembly Line 2 is more skewed to the left, indicating that Assembly Line 2 produced more number of finished units per day. → For Assembly Line 1,   1 Confidence Level 491.1 17.1 474,508.2 x     . Hence, we are 95% confident that the population mean number of finished units per day for Assembly Line 1 is lies between 474 and 509 units. → For Assembly Line 2,   2 Confidence Level 499.4 25.2 474.2,524.6 x     Hence, we are 95% confident that the population mean number of finished units per day for Assembly Line 2 is lies between 475 and 525 units. SZS2017 MIND EXPANDING EXERCISES 12. A lecturer is interested to investigate the students’ performance in statistics course based on their carry mark and the final score in the final examination. The descriptive statistics and graph are given below. From the analyses, comment on the students’ performance based on carry marks and final examination scores. SZS2017 MIND EXPANDING EXERCISES ME.12 SZS2017
  • 31. MIND EXPANDING EXERCISES 13. A study is conducted to compare the performance of male and female students in the statistics course for final examination scores. The data, descriptive statistics and graph of the final examination scores are presented as follow. Based on the analysis, answer the following questions: Female 72 62 83 65 60 74 66 68 57 63 61 76 60 78 34 70 59 63 86 43 90 87 Male 58 81 86 68 70 77 54 54 72 41 33 52 70 37 67 39 74 32 8 33 27 23 54 SZS2017 MIND EXPANDING EXERCISES a) State the mean and standard deviation for both groups and give your comment. b) Based on the graph shown, give your comment. ME.13 SZS2017 MIND EXPANDING EXERCISES 14.People with diabetes must monitor and control their blood glucose level. The goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The data presented below give the fasting plasma glucose for two groups, before treatment and after treatment. Answer the following questions: a) How many data in each group? b) Give the first five data in the ‘before’ group and last five data in the ‘after’ group. c) Identify the median and mode in each group. d) Describe the shape of the distribution of data in each group. e) Is there any outlier in the groups? f) What are the advantages of using stem and leaf plot? g) Which data is more dispersed (consistent)? h) Based on the descriptive analysis done in Excel, why do you think that the dispersion for both groups using variance is different from variance given by IQR? SZS2017 MIND EXPANDING EXERCISES ME.14 8 7 8 6 5 9 3 10 2 11 12 8 8 4 13 7 5 8 1 14 3 8 15 8 9 16 3 4 0 2 2 17 18 8 19 5 8 0 20 21 22 7 6 3 1 0 23 24 5 25 26 1 27 28 3 29 30 31 32 33 34 9 35 Before After Key: 14|1=141 SZS2017
  • 32. 1.4 EXPLORATORY DATA ANALYSIS  Identify outliers.  Draw and interpret a boxplot. SZS2017 Exploratory Data Analysis  The purpose of exploratory data analysis is to discover any gaps or pattern in the data.  For symmetric data, the appropriate measure of central tendency is mean and for variability is standard deviation or variance.  For skewed data, the appropriate measure of central tendency is median and for measure of variability is interquartile range (IQR). Traditional Method Exploratory Data Analysis Frequency distribution Stem and leaf plot Histogram Boxplot Mean Median Standard deviation Interquartile range (IQR=Q3-Q1) SZS2017 RECALL: Selection of appropriate statistical techniques for data summarisation Type of Data Descriptive Statistics Graphical Summary Quantitative (ratio scale) Mean, Median, Mode, Range, Standard Deviation, Interquartile range (IQR =Q3-Q1) Histogram, Bar Chart (bar representing means), stem and leaf plot, Boxplot Symmetrical Distribution Mean, Median, Mode, Range, Standard Deviation Histogram, Bar Chart (bar representing means) Skewed Distribution Median, Range, Interquartile range (IQR =Q3-Q1) Histogram, Stem and leaf plot, Boxplot Categorical (Nominal) Mode, Counts, Percentage Pie Chart, Bar Chart Categorical (Ordinal, Likert Scale) Mode, Mean, Counts, Percentage Pie Chart, Bar Chart SZS2017 Histogram, Stem and Leaf OR Boxplot? Type of Graph Advantages Disadvantages Histogram ‒ Can graph huge data sets easily. ‒ The shape of distribution can be easily described. ‒ You could change the intervals of the histogram to see which gives a better description of the data. ‒ Great for comparing data. ‒ Can show trends in the data clearly. ‒ Not good for small data set. ‒ It is difficult to simplify all the data into one scale. Stem and Leaf ‒ Very easy to construct. ‒ Show the real value of data ‒ Can shows range, minimum & maximum, gaps & clusters, and outliers easily. ‒ May observe the mode. ‒ Can identify the shape of distribution. ‒ Not good for small data set or very large data set. ‒ Not visually appealing. ‒ Does not easily indicate measures of centrality for large data sets. Boxplot ‒ Good for small or large data sets. ‒ It displays the range and distribution of data along a number line. ‒ Can shows outliers. ‒ Original data is not clearly shown in the box plot. ‒ Mean and mode cannot be identified in a box plot. SZS2017
  • 33. 1.4.1 Outliers  Outlier is an extremely high or an extremely low data value when compared with the rest of the data values.  Outliers can happen from:  the result of measurement or observational error,  the written or typing error,  the data value obtained from a subject that is not in the defined population, or  the legitimate data value occurred by chance.  When a distribution is symmetric or normal, data values that are beyond three standard deviations of the mean can be considered as suspected outliers (refer Figure 1.11).  An outlier can strongly affect the mean and standard deviation of a variable. SZS2017 Recall: Other Properties of Standard Deviation  Use to determine the number of data values that fall within a specified interval in a distribution.  The values under curve indicate the percentage of area in each section or range of data.  It can be seen that about 95% of data values are fall within 𝜇 − 2𝜎 and 𝜇 + 2𝜎. SZS2017 Position of Outliers A data value x is an outlier if it less than the lower boundary value or exceed the upper boundary value for the data set. SZS2017 → Since , thus there is no outlier. EXAMPLE 1.11 The number of credits in business courses for eight job applicants is shown here: 9, 12, 15, 27, 33, 45, 63, 72. Find the first and third quartiles for the above data. Is there any outlier on the above data? SZS2017         2 3 1 2 1 8 4 6 7 3 6 3 8 4 1 3 1 3 3 1 13.5 2 54 2 lower boundary: 1.5 13.5 1.5(54 13.5) 47.25 upper boundary: 1.5 54 1.5(54 13.5) 114.75 x x Q x x x x Q x x Q Q Q Q Q Q                        47.25 114.75 x   
  • 34. EXERCISE 1.4.1 1. Given 19 2 1 4 3 7 5 4 6 . Find outliers if any. 2. Given 19 6 2 11 4 3 7 7 5 8 6 21 12. Find outliers if any. SZS2017 1 3 3, 6; 19 is outliers Q Q   1 3 5, 11; 21is outliers Q Q   MIND EXPANDING EXERCISES 14.People with diabetes must monitor and control their blood glucose level. The goal is to maintain fasting plasma glucose between 90 and 130 mg/dl. The data presented below give the fasting plasma glucose for two groups, before treatment and after treatment. Answer the following questions: a) How many data in each group? b) Give the first five data in the ‘before’ group and last five data in the ‘after’ group. c) Identify the median and mode in each group. d) Describe the shape of the distribution of data in each group. e) Is there any outlier in the groups? f) What are the advantages of using stem and leaf plot? g) Which data is more dispersed (consistent)? h) Based on the descriptive analysis done in Excel, why do you think that the dispersion for both groups using variance is different from variance given by IQR? SZS2017 MIND EXPANDING EXERCISES ME.14 8 7 8 6 5 9 3 10 2 11 12 8 8 4 13 7 5 8 1 14 3 8 15 8 9 16 3 4 0 2 2 17 18 8 19 5 8 0 20 21 22 7 6 3 1 0 23 24 5 25 26 1 27 28 3 29 30 31 32 33 34 9 35 Before After Key: 14|1=141 SZS2017 1.4.2 Boxplots SZS2017  The lowest value of data set (minimum)  The lower quartile Q1 (1st Quartile or 25th percentile)  The median (2nd Quartile or 50th percentile)  The upper quartile Q3 (3rd Quartile or 75th percentile)  The highest value of data set (maximum)  Outliers Boxplot (Box and Whiskers plot) is graphical representations of a five- number summary of a data set and outliers. five-number summaries + Outliers
  • 35. Types of Boxplots A Vertical boxplot A Horizontal boxplot SZS2017 SZS2017 EXAMPLE 1.12 SZS2017 The following mixture stem and leaf plot represent sample of age of teachers in two schools. School A Stem School B 9 7 7 5 5 4 2 2 8 7 6 2 1 1 0 3 3 4 6 7 4 0 1 3 4 5 7 7 5 1 3 4 Given that for School B, 1 2 3 36, 42, 47 Q Q Q    and there is no outlier. Draw Boxplots for both schools on the same x-axis. Then compare shapes, averages, and variability of both age distributions [key: 3|4 → 34] School A School B Minimum 24 22 1st quartile   1 3.5 4 1 14 4 27 Q x x x     1 36 Q  2nd quartile/ Median 7 8 2 30.5 2 x x Q    2 42 Q  3rd quartile   3 10.5 11 3 14 4 36 Q x x x     3 47 Q  Maximum 38 54 Outliers     1 3 1 3 3 1 1.5 27 1.5(36 27) 13.5 1.5 36 1.5(36 27) 49.5 Q Q Q Q Q Q             Since 57 > 49.5, Thus 57 is an outlier. no outlier Information Obtain from a Boxplot 1. If the median is near the centre of the box, the distribution is approximately symmetric. 2. If the median falls to the left of the centre of the box, the distribution is positively skewed. 3. If the median falls to the right of the centre of the box, the distribution is negatively skewed.  Suppose the median is near the centre of the box (approximately symmetric): 4. If the lines are about the same length, the distribution is approximately symmetric. 5. If the right line is larger than the left line, the distribution is positively skewed. 6. If the left line is larger than the right line, the distribution is negatively skewed.  If the boxplots for two or more data sets are graphed on the same axis, the distributions can be compared using their central tendency (average) and variability values.  To compare the average, use the location of the medians.  To compare the variability, use the length of the IQR. SZS2017
  • 36. EXAMPLE 1.12 SZS2017 The following mixture stem and leaf plot represent sample of age of teachers in two schools. School A Stem School B 9 7 7 5 5 4 2 2 8 7 6 2 1 1 0 3 3 4 6 7 4 0 1 3 4 5 7 7 5 1 3 4 Given that for School B, 1 2 3 36, 42, 47 Q Q Q    and there is no outlier. Draw Boxplots for both schools on the same x-axis. Then compare shapes, averages, and variability of both age distributions [key: 3|4 → 34] School A School B Minimum 24 22 1st quartile   1 3.5 4 1 14 4 27 Q x x x     1 36 Q  2nd quartile/ Median 7 8 2 30.5 2 x x Q    2 42 Q  3rd quartile   3 10.5 11 3 14 4 36 Q x x x     3 47 Q  Maximum 38 54 Outliers     1 3 1 3 3 1 1.5 27 1.5(36 27) 13.5 1.5 36 1.5(36 27) 49.5 Q Q Q Q Q Q             Since 57 > 49.5, Thus 57 is an outlier. no outlier EXAMPLE 1.12 solution SZS2017 Shape: Based on the location of median, School A has right-skewed distribution where most of teachers’ age is concentrated at the lower age (< 30 years old). However, School B has left-skewed distribution where most of teachers’ age is greater than 42 years old. Average: Based on the median value, 50% of teacher at School A age less than 30.5 years old whereas 50% of teacher at School B age less than 42 years. On average, teachers at School B is older than the teachers at School A. EXAMPLE 1.12 solution SZS2017 Variability: Based on the IQR value, for School A, IQRA = 9 years where most 50% of the teachers age between 27-36 years old. Meanwhile, for School B, IQRB = 11 years where most 50% of the teachers age between 36-47 years. Hence, the variation of teachers’ age at School B is higher than age of teacher at School A (IQRA < IQRB). Range: Without outlier, teachers’ age at school A varies less from minimum age of 24 years to maximum age of 38 years as compared to School B with minimum age of 22 years to maximum of 54 years. Boxplot for Special Case  In some cases, we cannot use the general guideline as given above to interpret the boxplot.  Boxplot is not the best graphical representation to describe a data set if the sample size of the data set is too small.  The existence of outliers also may affect the boxplot.  Therefore, in such cases, we have to use the descriptive statistics to identify the distribution of the data set. SZS2017
  • 37. EXERCISE 1.4.2 (Q1) 1. Plot a boxplot for the following data. Then describe the data. a) 3.2, 5.9, 4.3, 6.9, 4.5, 8.0, 4.7, 8.9, 5.7, 11.9 b) 5.8, 9.7, 6.7,13.4, 6.8, 14.7, 7.2, 16.4, 8.2, 28.1 SZS2017 1 2 3 3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed Min Q Q Q Max      1 2 3 5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed Min Q Q Q Max      1.4.2 (Q1) solution SZS2017 1 2 3 5.8, 6.8, 8.95, 14.7,28.1 is outlier, 16.4, right-skewed Min Q Q Q Max      1 2 3 3.2, 4.5, 5.8, 8,no outlier, 11.9, right-skewed Min Q Q Q Max      EXERCISE 1.4.2(Q2) 2. Two samples of ten springs made out of the steel rods supplied by two different companies were compared. The measurement of flexibility (in N/m) for each spring was recorded as follows. Compare the distributions using box-plots. Company A: 4.2 6.7 7.3 7.5 8.0 8.5 8.7 8.8 9.2 9.3 Company B: 9.6 9.7 9.8 9.9 10.1 10.2 11.0 11.0 11.0 11.1 Give comment on the flexibility of springs supplied by two different companies. SZS2017 1 2 3 1 2 3 Company A: 6.7, 7.3, 8.25, 8.8, 4.2 is outlier, 9.3, left-skewed Company B: 9.6, 9.8, 10.15, 11.0, no outlier, 16.4, right-skewed Min Q Q Q Max Min Q Q Q Max           1.4.2 (Q2) solution
  • 38. EXERCISE 1.4.2 (Q3) 3. The following Table presents viscosity (in Pascal) of chemical substance from three (3) batches of chemical process. Batches Viscosity Batch A 13.3 14.1 14.3 14.5 14.5 14.6 14.8 15.2 15.3 15.3 Batch B 13.3 13.7 14.1 14.5 14.9 15.2 15.3 15.4 15.6 15.8 Batch C 13.4 13.7 14.1 14.3 14.3 14.8 15.1 15.8 16.4 16.9 a) Complete the table below by showing all the necessary calculations. Measures of position Batch A Batch B Batch C 1st quartile 14.30 14.10 Median 14.55 14.55 3rd quartile 15.40 15.80 Outlier No No b)Draw three boxplots on the same x-axis by using the information in (a). c) Compare the boxplots in terms of shape and variability. SZS2017 3 2 1 Batch A : 15.2, right-skewed; Batch B: 15.05, no outlier, left-skewed; Batch C : 14.1, right-skewed Q Q Q    1.4.2 (Q3) solution 12 12.5 13 13.5 14 14.5 15 15.5 16 16.5 17 Batch A Batch B Batch C MIND EXPANDING EXERCISES ME.15 SZS2017 MIND EXPANDING EXERCISES 15. An experiment was conducted to assess the potency of various constituents of orchard sprays in repelling honeybees. Individual cells of dry comb were filled with measured amounts of lime Sulphur emulsion in sucrose solution. Seven different concentrations of lime Sulphur ranging from a concentration of 1/100 to 1/1,562,500 in successive factors of 1/5 were used as well as a solution containing no lime Sulphur (A, B, C, D, E, F, G, H). The responses for the different solutions were obtained by releasing 100 bees into the chamber for two hours, and then measuring the decrease in volume of the solutions in the various cells. Based on the figure below, answer the following questions: a) Which concentration has outlier(s)? b) Group the concentration according to their shape of distribution. c) Which concentration has the most consistent data? Why? d) Which concentration has the most variable data? Why? e) H is the concentration of ‘no lime sulphur’. What is the use of concentration H? f) What conclusion can you draw from this experiment? SZS2017
  • 39. 1.5 NORMAL PROBABILITY PLOT  Draw and interpret a normal probability plot. SZS2017 Normal Probability Plots  The easiest way to check whether the sample distribution is normal or not.  The most plausible normal distribution is the one whose mean and standard deviation are the same as the sample mean and standard deviation. STEP 1 : Sort the data in ascending order and denote each sorted data as , 1, , . i x i n  STEP 2 : Numbered the sorted data from i to n. STEP 3 : Calculate the probability value for each xi using 0.5 i i p n   . STEP 4 : Plot pi versus xi. If the sample points lie approximately on a straight line, the data is approximately normally distributed. SZS2017 Testing Normality using Software Other than plot manually, we can obtain it from software such as SPSS, Minitab, Excel, and etc. The normality of the data also can be tested by using Kolmogorov Smirnov, Anderson Darling or Shapiro-Wilk Tests. SZS2017 EXAMPLE 1.13 → The graph pi versus xi from the figure above is known as the normal probability plot. Since the data lies approximately on a straight line, the data is normally distributed. SZS2017
  • 40. EXERCISE 1.5 1. A sample of size six is drawn. The sample, arranged in increasing order, is 3.01 3.35 4.79 5.96 7.89 9.15 Do these data appear to come from an approximately normal distribution? 2. The data shown represent the number of movies in America for 14-year period. 2084 1497 1014 910 899 870 859 848 837 826 815 750 737 637 Do these data appear to come from an approximately normal distribution? SZS2017 1) yes 2) no 1.5 (Q1) solution SZS2017 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 1.5 (Q2) solution SZS2017 0.0000 0.2000 0.4000 0.6000 0.8000 1.0000 1.2000 0 500 1000 1500 2000 2500 Pi xi CONCLUSION • The applications of statistics are many and varied. People encounter them in everyday life, such as in reading newspapers or magazines, listening to the radio, or watching television. • By combining all of the descriptive statistics techniques discussed in this chapter together, the student is now able to collect, organize, summarize and present data. Thank You NEXT: Chapter 2 Sampling Distribution and Confidence Interval SZS2017
  • 41. REFERENCES 1. Walpole R.E., Myers R.H., Myers S.L. & Ye K. 2011. Probability and Statistics for Engineers and Scientists. 9th Edition. New Jersey: Prentice Hall. 2. Navidi W. 2011. Statistics for Engineers and Scientists. 3rd Edition. New York: McGraw-Hill. 3. Triola, M.F. 2006. Elementary Statistics.10th Edition. UK: Pearson Education. 4. Bluman A.G. 2009. Elementary Statistics: A Step by Step Approach. 7th Edition. New York: McGraw–Hill. 5. Weiss, N.A. 2002. Introductory Statistics. 6th Edition. United States: Addison-Wesley. 6. Sanders D.H. & Smidth R.K. 2000. Statistics: A First Course. 6th Edition. New York: McGraw- Hill. 7. Crawshaw, J. & Chambers,J. 2001. A Concise Course in Advance Level Statistics with Work Examples, 4th Edition, Nelson Thornes. 8. Satari S. Z. et al. Applied Statistics Module New Version. 2015. Penerbit UMP. Internal used. Thank You NEXT: Chapter 2 Sampling Distribution and Confidence Interval SZS2017