PART 1
INTRODUCTION
TO DATA ANALYSIS
1.1 BASIC DEFINITIONS OF STATISTICS
Statistics is the science of data, but in
reality statistics is both the science and the
art of collecting, organizing, summarizing,
analyzing and drawing conclusion based
on data.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 2
Type of statistics
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 3
Type of statistics
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
 Descriptive Statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.
 Inferential Statistics consists of methods that use sample results to help make
decisions or predictions about a population.
4
Basic Terms
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
 Population or Target Population A population consists of all elements—
individuals, items, or objects—whose characteristics are being studied. The
population that is being studied is also called the target population.
 Sample A portion of the population selected for study.
5
Basic Terms
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
Examples of Population and Sample:
 Weight of fish caught by all participants in a fishing
competition.
 Credit card debts of 100 families selected from a city.
 Amount spent on prescription medicine by 200 senior citizens in a large
city.
 A researcher wants to know about the monthly expenses of
government servants in Malaysia.
 A researcher wants to know about monthly expenses of government
servant in Malaysia.
POPULATION
SAMPLE
SAMPLE
POPULATION
SAMPLE
6
Basic Terms
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
 Parameter A numerical value that represents a certain population characteristic.
Example: The average height of students from a population of students in a university.
 Statistic A numerical value that represents a certain sample characteristic.
Example: The average height for a sample of female students in a university.
7
Basic Terms
8
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
Element or Member An element or member of a sample or population is a specific subject or
object (for example, a person, firm, item, state, or country) about which the information is
collected.
Variable A variable is a characteristic under study that assumes different values for different
elements.
Observation or Measurement The value of a variable for an element is called an observation or
measurement.
Data Set A data set is a collection of observations on one or more variables.
Basic Terms
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS
Element or Member An element or member of a sample or population is a specific subject or
object (for example, a person, firm, item, state, or country) about which the information is
collected.
Variable A variable is a characteristic under study that assumes different values for different
elements.
Observation or Measurement The value of a variable for an element is called an observation or
measurement.
Data Set A data set is a collection of observations on one or more variables.
9
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 10
Types of Variables
 A variable that can be
measured numerically
is called a quantitative
variable.
 The data collected on a
quantitative variable
are called quantitative
data.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 11
Types of Variables
 A variable whose values are
countable is called a discrete
variable.
 A variable that can assume any
numerical value over a certain
interval or intervals is called a
continuous variable.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 12
Types of Variables
 A variable that cannot assume a
numerical value but can be
classified into two or more non-
numeric categories is called a
qualitative or categorical variable.
 The data collected on such a
variable are called qualitative
data.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 13
Levels of Measurements
 Classifies data into
mutually exclusive
(nonoverlapping),
exhausting
categories in which
no order or ranking
can be imposed on
the data.
 Example: Gender
(male, female)
NOMINAL
 Classifies data into
categories that can
be ranked; however,
precise differences
between the ranks
do not exist.
 Example: Rating
scale (poor, good,
excellent)
ORDINAL INTERVAL
 Ranks data.
 Precise differences
between units of
measure do exist;
however, there is
no meaningful
zero.
 Example: IQ,
Temperature
RATIO
 Possesses all the
characteristics of
interval
measurement, and
there exists a true
zero.
 Example: Height,
Weight
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 14
PRACTICE
Determine whether the following statement is about qualitative data
(nominal/ordinal) or quantitative data (continuous/discrete):
1. Colors of cars 9. Number of people
2. Children weight 10. Time
3. Favourite sport 11. Number of books
4. Distance to school 12. Performance of staff
5. Types of tree 13. Gender
6. Temperature 14. Satisfaction with bank service
7. Grade student 15. Religion
8. Shoe size 16. Colours of iris
1.2 DATA DESCRIPTION
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 15
Frequency Distribution
 Data collected in original form is called raw data.
A frequency distribution is the organization of raw data in table form using classes and
frequencies.
There are two types of frequency distributions
Ungrouped frequency distributions
Grouped frequency distributions
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 16
Ungrouped Frequency Distribution
Suppose we have the following test scores of 15 students:
56, 60, 62, 62, 58, 60, 62, 62, 58, 56, 58, 60, 58, 62, 64
Construct an ungrouped frequency distribution.
Test scores Frequency, f Tally Relative
Frequency
Percentage
56 2 || 2/15 = 0.1333 13.33%
58 4 |||| 4/15 = 0.2667 26.67%
60 3 ||| 3/15 = 0.2000 20%
62 5 |||| 5/15 = 0.3333 33.33%
64 1 | 1/15 = 0.0667 6.67%
= 15
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 17
Grouped Frequency Distribution
Construct a grouped frequency distribution using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Solution:
Step 1: i. Determine the classes. Usually the number of classes for a frequency distribution table varies from 5
to 20.
ii. Find the highest and lowest value: Highest = 134, Lowest = 100.
iii. Find the range, R = Highest – Lowest = 134 – 100 = 34.
iv. Find the width = R/number of classes = 34/7 = 4.9 = 5. **Rounding up
v. Select a starting point (usually the lowest value or any convenient number less than the lowest value);
add
the width to get the lower limits.
vi. Find the upper-class limits.
vii. Find the boundaries.
Step 2: Tally the data.
Step 3: Find the frequency and complete frequency distribution.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 18
Grouped Frequency Distribution
Class
limits
Frequency
, f
Relative
Frequency
Tally Cumulative
Frequency, F
Class
boundaries
100 – 104 2 0.04 || 2 99.5 – 104.5
105 – 109 8 0.16 |||| ||| 10 104.5 – 109.5
110 – 114 18 0.36 |||| ||||
||||
|||
28 109.5 – 114.5
115 – 119 13 0.26 |||| |||| ||| 41 114.5 – 119.5
120 – 124 7 0.14 |||| || 48 119.5 – 124.5
125 - 129 1 0.02 | 49 124.5 – 129.5
130 - 134 1 0.02 | 50 129.5 – 134.5
= 50
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 19
Measures of Central Tendency
 A measure of central tendency is a descriptive statistic that describes the average, or typical
value of a set of scores.
The mean also
called the
arithmetic mean,
is the most
frequently used
measure of central
tendency.
The median is
the value of the
middle term in a
data set that has
been ranked in
increasing order.
The mode is the
value that
occurs with the
highest
frequency in a
data set.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 20
Measures of Central Tendency
Ungrouped data Grouped data
Mean
Median
i. Arrange the data in an ascending or
descending order.
 If the number of observation, n is an odd,
then the observation at position
 If the number of observation, n is an even
number, then median is the average
between observation at position and
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 21
Measures of Central Tendency
Ungrouped data Grouped data
Mode The value that occurs frequently in a
set of data.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 22
i. Mean:
Example of Measures of Central Tendency (Ungrouped Data)
The data represent the number of days off per year for a sample of individuals selected from nine different
countries. Find the mean, median and mode.
20, 26, 40, 36, 23, 42, 35, 24, 30
 Solution:
ii. Median:
Step 1: 20, 23, 24, 26, 30, 35, 36, 40, 42
Step 2: [(n+1)/2]th = 5th
Median = 30
iii. Mode for ungrouped
• No Mode for this dataset
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 23
Example of Measures of Central Tendency (Grouped Data)
The frequency distribution represents the number of orders received each day during the past 50 days at
the
office of a mail-order company.
Class limits Frequency, f
10 – 12 4
13 – 15 12
16 – 18 20
19 - 21 14
= 50
Midpoint,
m @
m x f Cumulative
Frequency, F
Class
Boundaries
11 44 4 9.5 – 12.5
14 168 16 12.5 – 15.5
17 340 36 15.5 – 18.5
20 280 50 18.5 – 20.5
= 832
i. Mean:
ii. Median: Median position = 50/2 =25th
Median class = 16 - 18
iii. Mode: Modal class = 16 - 18
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 24
Relationships Among the Mean, Median, and Mode
For a histogram and a frequency
distribution curve skewed to the right, the
value of the mean is the largest, that of
the mode is the smallest, and the value of
the median lies between these two. Notice
that the mode always occurs at the peak
point. The value of the mean is the largest
in this case because it is sensitive to
outliers that occur in the right tail. These
outliers pull the mean to the right.
Positively skewed or
right-skewed
Symmetric
For a symmetric histogram and
frequency distribution curve with
one peak, the values of the mean,
median, and mode are identical,
and they lie at the center of the
distribution.
Negatively skewed or
left-skewed
If a histogram and a frequency
distribution curve are skewed to the
left,
the value of the mean is the smallest
and that of the mode is the largest,
with the value of the median lying
between these two. In this case, the
outliers in the left tail pull the mean to
the left.
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 25
PRACTICE
1. The administration in a large city wanted to know the distribution of vehicles owned by households in that city.
A sample of 40 randomly selected households from this city produced the following data on the number of
vehicles owned.
5 1 1 2 0 1 1 2 1 1
1 3 3 0 2 5 1 2 3 4
2 1 2 2 1 2 2 1 1 1
4 2 1 1 2 1 1 4 1 3
Construct an ungrouped frequency distribution table for these data.
2. The following table gives the frequency distribution of ages for all 50 employees of a company.
a. Find the class boundaries and class midpoints.
b. Do all classes have the same width? If yes, what is that width?
c. Prepare the relative frequency and percentage distribution columns.
d. What percentage of the employees of this company are age 43 or younger?
CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 26
PRACTICE
3. The following data give the numbers of car thefts that occurred in a city during the past 12 days.
6 3 7 11 4 3 8 7 2 6 9 15
Find the mean, median, and mode.
4. The number of days that students were missing from school due to sickness in one year was recorded.
Find the mean, median, and mode.

Introduction_to_Data_Analysis_Part_1.pptx

  • 1.
  • 2.
    1.1 BASIC DEFINITIONSOF STATISTICS Statistics is the science of data, but in reality statistics is both the science and the art of collecting, organizing, summarizing, analyzing and drawing conclusion based on data. CHAPTER 1: INTRODUCTION TO DATA ANALYSIS 2
  • 3.
    Type of statistics CHAPTER1: INTRODUCTION TO DATA ANALYSIS 3
  • 4.
    Type of statistics CHAPTER1: INTRODUCTION TO DATA ANALYSIS  Descriptive Statistics consists of methods for organizing, displaying, and describing data by using tables, graphs, and summary measures.  Inferential Statistics consists of methods that use sample results to help make decisions or predictions about a population. 4
  • 5.
    Basic Terms CHAPTER 1:INTRODUCTION TO DATA ANALYSIS  Population or Target Population A population consists of all elements— individuals, items, or objects—whose characteristics are being studied. The population that is being studied is also called the target population.  Sample A portion of the population selected for study. 5
  • 6.
    Basic Terms CHAPTER 1:INTRODUCTION TO DATA ANALYSIS Examples of Population and Sample:  Weight of fish caught by all participants in a fishing competition.  Credit card debts of 100 families selected from a city.  Amount spent on prescription medicine by 200 senior citizens in a large city.  A researcher wants to know about the monthly expenses of government servants in Malaysia.  A researcher wants to know about monthly expenses of government servant in Malaysia. POPULATION SAMPLE SAMPLE POPULATION SAMPLE 6
  • 7.
    Basic Terms CHAPTER 1:INTRODUCTION TO DATA ANALYSIS  Parameter A numerical value that represents a certain population characteristic. Example: The average height of students from a population of students in a university.  Statistic A numerical value that represents a certain sample characteristic. Example: The average height for a sample of female students in a university. 7
  • 8.
    Basic Terms 8 CHAPTER 1:INTRODUCTION TO DATA ANALYSIS Element or Member An element or member of a sample or population is a specific subject or object (for example, a person, firm, item, state, or country) about which the information is collected. Variable A variable is a characteristic under study that assumes different values for different elements. Observation or Measurement The value of a variable for an element is called an observation or measurement. Data Set A data set is a collection of observations on one or more variables.
  • 9.
    Basic Terms CHAPTER 1:INTRODUCTION TO DATA ANALYSIS Element or Member An element or member of a sample or population is a specific subject or object (for example, a person, firm, item, state, or country) about which the information is collected. Variable A variable is a characteristic under study that assumes different values for different elements. Observation or Measurement The value of a variable for an element is called an observation or measurement. Data Set A data set is a collection of observations on one or more variables. 9
  • 10.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 10 Types of Variables  A variable that can be measured numerically is called a quantitative variable.  The data collected on a quantitative variable are called quantitative data.
  • 11.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 11 Types of Variables  A variable whose values are countable is called a discrete variable.  A variable that can assume any numerical value over a certain interval or intervals is called a continuous variable.
  • 12.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 12 Types of Variables  A variable that cannot assume a numerical value but can be classified into two or more non- numeric categories is called a qualitative or categorical variable.  The data collected on such a variable are called qualitative data.
  • 13.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 13 Levels of Measurements  Classifies data into mutually exclusive (nonoverlapping), exhausting categories in which no order or ranking can be imposed on the data.  Example: Gender (male, female) NOMINAL  Classifies data into categories that can be ranked; however, precise differences between the ranks do not exist.  Example: Rating scale (poor, good, excellent) ORDINAL INTERVAL  Ranks data.  Precise differences between units of measure do exist; however, there is no meaningful zero.  Example: IQ, Temperature RATIO  Possesses all the characteristics of interval measurement, and there exists a true zero.  Example: Height, Weight
  • 14.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 14 PRACTICE Determine whether the following statement is about qualitative data (nominal/ordinal) or quantitative data (continuous/discrete): 1. Colors of cars 9. Number of people 2. Children weight 10. Time 3. Favourite sport 11. Number of books 4. Distance to school 12. Performance of staff 5. Types of tree 13. Gender 6. Temperature 14. Satisfaction with bank service 7. Grade student 15. Religion 8. Shoe size 16. Colours of iris
  • 15.
    1.2 DATA DESCRIPTION CHAPTER1: INTRODUCTION TO DATA ANALYSIS 15 Frequency Distribution  Data collected in original form is called raw data. A frequency distribution is the organization of raw data in table form using classes and frequencies. There are two types of frequency distributions Ungrouped frequency distributions Grouped frequency distributions
  • 16.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 16 Ungrouped Frequency Distribution Suppose we have the following test scores of 15 students: 56, 60, 62, 62, 58, 60, 62, 62, 58, 56, 58, 60, 58, 62, 64 Construct an ungrouped frequency distribution. Test scores Frequency, f Tally Relative Frequency Percentage 56 2 || 2/15 = 0.1333 13.33% 58 4 |||| 4/15 = 0.2667 26.67% 60 3 ||| 3/15 = 0.2000 20% 62 5 |||| 5/15 = 0.3333 33.33% 64 1 | 1/15 = 0.0667 6.67% = 15
  • 17.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 17 Grouped Frequency Distribution Construct a grouped frequency distribution using 7 classes. 112 100 127 120 134 118 105 110 109 112 110 118 117 116 118 122 114 114 105 109 107 112 114 115 118 117 118 122 106 110 116 108 110 121 113 120 119 111 104 111 120 113 120 117 105 110 118 112 114 114 Solution: Step 1: i. Determine the classes. Usually the number of classes for a frequency distribution table varies from 5 to 20. ii. Find the highest and lowest value: Highest = 134, Lowest = 100. iii. Find the range, R = Highest – Lowest = 134 – 100 = 34. iv. Find the width = R/number of classes = 34/7 = 4.9 = 5. **Rounding up v. Select a starting point (usually the lowest value or any convenient number less than the lowest value); add the width to get the lower limits. vi. Find the upper-class limits. vii. Find the boundaries. Step 2: Tally the data. Step 3: Find the frequency and complete frequency distribution.
  • 18.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 18 Grouped Frequency Distribution Class limits Frequency , f Relative Frequency Tally Cumulative Frequency, F Class boundaries 100 – 104 2 0.04 || 2 99.5 – 104.5 105 – 109 8 0.16 |||| ||| 10 104.5 – 109.5 110 – 114 18 0.36 |||| |||| |||| ||| 28 109.5 – 114.5 115 – 119 13 0.26 |||| |||| ||| 41 114.5 – 119.5 120 – 124 7 0.14 |||| || 48 119.5 – 124.5 125 - 129 1 0.02 | 49 124.5 – 129.5 130 - 134 1 0.02 | 50 129.5 – 134.5 = 50
  • 19.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 19 Measures of Central Tendency  A measure of central tendency is a descriptive statistic that describes the average, or typical value of a set of scores. The mean also called the arithmetic mean, is the most frequently used measure of central tendency. The median is the value of the middle term in a data set that has been ranked in increasing order. The mode is the value that occurs with the highest frequency in a data set.
  • 20.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 20 Measures of Central Tendency Ungrouped data Grouped data Mean Median i. Arrange the data in an ascending or descending order.  If the number of observation, n is an odd, then the observation at position  If the number of observation, n is an even number, then median is the average between observation at position and
  • 21.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 21 Measures of Central Tendency Ungrouped data Grouped data Mode The value that occurs frequently in a set of data.
  • 22.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 22 i. Mean: Example of Measures of Central Tendency (Ungrouped Data) The data represent the number of days off per year for a sample of individuals selected from nine different countries. Find the mean, median and mode. 20, 26, 40, 36, 23, 42, 35, 24, 30  Solution: ii. Median: Step 1: 20, 23, 24, 26, 30, 35, 36, 40, 42 Step 2: [(n+1)/2]th = 5th Median = 30 iii. Mode for ungrouped • No Mode for this dataset
  • 23.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 23 Example of Measures of Central Tendency (Grouped Data) The frequency distribution represents the number of orders received each day during the past 50 days at the office of a mail-order company. Class limits Frequency, f 10 – 12 4 13 – 15 12 16 – 18 20 19 - 21 14 = 50 Midpoint, m @ m x f Cumulative Frequency, F Class Boundaries 11 44 4 9.5 – 12.5 14 168 16 12.5 – 15.5 17 340 36 15.5 – 18.5 20 280 50 18.5 – 20.5 = 832 i. Mean: ii. Median: Median position = 50/2 =25th Median class = 16 - 18 iii. Mode: Modal class = 16 - 18
  • 24.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 24 Relationships Among the Mean, Median, and Mode For a histogram and a frequency distribution curve skewed to the right, the value of the mean is the largest, that of the mode is the smallest, and the value of the median lies between these two. Notice that the mode always occurs at the peak point. The value of the mean is the largest in this case because it is sensitive to outliers that occur in the right tail. These outliers pull the mean to the right. Positively skewed or right-skewed Symmetric For a symmetric histogram and frequency distribution curve with one peak, the values of the mean, median, and mode are identical, and they lie at the center of the distribution. Negatively skewed or left-skewed If a histogram and a frequency distribution curve are skewed to the left, the value of the mean is the smallest and that of the mode is the largest, with the value of the median lying between these two. In this case, the outliers in the left tail pull the mean to the left.
  • 25.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 25 PRACTICE 1. The administration in a large city wanted to know the distribution of vehicles owned by households in that city. A sample of 40 randomly selected households from this city produced the following data on the number of vehicles owned. 5 1 1 2 0 1 1 2 1 1 1 3 3 0 2 5 1 2 3 4 2 1 2 2 1 2 2 1 1 1 4 2 1 1 2 1 1 4 1 3 Construct an ungrouped frequency distribution table for these data. 2. The following table gives the frequency distribution of ages for all 50 employees of a company. a. Find the class boundaries and class midpoints. b. Do all classes have the same width? If yes, what is that width? c. Prepare the relative frequency and percentage distribution columns. d. What percentage of the employees of this company are age 43 or younger?
  • 26.
    CHAPTER 1: INTRODUCTIONTO DATA ANALYSIS 26 PRACTICE 3. The following data give the numbers of car thefts that occurred in a city during the past 12 days. 6 3 7 11 4 3 8 7 2 6 9 15 Find the mean, median, and mode. 4. The number of days that students were missing from school due to sickness in one year was recorded. Find the mean, median, and mode.