Unit - II FDS.pdf

lOMoAR cPSD|28265668
lOMoAR cPSD|2826
Department of Computer Science Engineering
CS3352-Foundations of Data Science
Unit - II: Describing Data
Data:
● A collection of actual observations or scores in a survey or an experiment
Types of Data and Variable:
● Data can be descriptive (characteristics) or numerical (numbers). Let us take
a look at some of the most prevalent types of data.
● There are two types of data:
1. Qualitative Data
2. Quantitative Data

Types of Data

(A) Qualitative Data:
● There are no numbers in qualitative data, so it cannot be measured. It is also called Categorical
Data because the data can be sorted by category rather than by number.
● Qualitative data is dealing with characteristics and descriptions that are difficult to measure but
may be subjectively observed, e.g., smells, tastes, textures, attractiveness, color, etc. They may
include favourite foods, favorite holiday destinations, religions, pictures, symbols, colours, and
so on.
● These data are described by some characteristics, for example, gender, blood group, etc. This data
can provide answers to questions such as: “How did it occur?” or “Why did this occur?"
● In general, qualitative data can be divided into two types:
1. Nominal data
2. Ordinal data

1. Nominal Data:
● This type of data is used for naming variables and has no numerical value
● Nominal data is a collection of values (non-numeric) that do not have a natural order.
● For example, it is not possible to state that 'Green' is greater than 'Blue', so we cannot compare
one color to another, and so the color of a thing is a nominal data type
● Examples of Nominal Data:
○ Colors: (Brown, Red, etc.)
○ Taste: (Sour, Sweet, Salty, etc.)
○ Languages: (Hindi, English, Marathi, Gujarati, Tamil, Telugu, etc.

2. Ordinal Data:
● Ordinal data is defined as qualitative data whose values are ordered.
● In this type of data, a natural ordering occurs while maintaining class values. In other words,
ordinal data is data that is sorted by its scale position. Ordinal numbers cannot be used for
arithmetic because they only display sequence.
● For example, we can easily sort the clothing brands' sizes according to their name tags in the
order of small < medium < large.
● Examples of Ordinal Data:
○ Economic status: (low, medium, high)
○ Letter grades: (A, B, C, D, E, etc.)
○ Rank in a competition: (First, Second, Third)

(B) Quantitative Data:
● Quantitative data are numbers.
● Numbers make up quantitative data. That is, the data represented in numbers, are quantitative
data. Quantitative data is made up of numbers and things that can be measured objectively, e.g.,
area, volume, height, width, length, weight, speed, humidity, temperature, prices, year etc.
● Quantitative data is always represented by numbers that indicate either how much or how many.
● In general, quantitative data can be divided into two types:
1. Discrete data
2. Continuous data

1. Discrete Data:
● Discrete data is counted, but it can only have certain values.
● Discrete data consists of finite, numeric, countable, and non-negative integers with discrete
variables.
● Generally, it involves integers. The number of pupils, the number of children, the shoe size, and
so on are all examples of discrete data.
● Examples of Discrete Data:
○ When we roll one die, we obtain 1, 2, 3, 4, 5, or 6 as discrete data.
○ The total number of students enrolled in a class is discrete data
○ The number of children in your household is discrete data

2. Continuous Data:
● Continuous data is measured, and its value can be anything within a range.
● Continuous data is a set of numbers that can have any decimal or fractional value. Height,
weight, length, time, temperature are all instances of continuous data.
● For example, The height of a person may be precisely 5.78 feet. We can measure someone's
height in meters, centimetres, millimetres, and so on, so height is continuous data.
● Examples of continuous data:
○ Newborn babies' body weight
○ A freezer temperature
○ Wind speed
● Continuous data can be further classified as measured on an interval scale or a ratio scale.

(i) Interval Scale
● Values that do not have a natural zero are referred to as the interval scale.
● An interval scale has order and the difference between two values is significant. You cannot
make a ratio out of these numbers, such as the temperature of a room in Celsius.
● Temperature, pH, and credit score are examples of interval variables.

(ii) Ratio Scale:
● A ratio scale is a set of values that have a natural zero.
● Something measured on a ratio scale has the same properties as something measured on an
interval scale, with the exception that there is an absolute zero point with ratio scaling. In
other words, a ratio variable contains all of the attributes of an interval variable, plus a distinct
definition of 0.0. There is no value for the variable when it equals 0.0.
● An example is a temperature measured in Kelvin. Below 0 degrees Kelvin, there is no value
possible; it is absolute zero.
● Another example is weight; 0 kg indicates a notable absence of weight.

Question : Indicate whether each of the following terms is qualitative (because it’s a word, letter, or
numerical code representing a class or category); or quantitative (because it’s a number representing an
amount or a count).
1) age
2) family size
3) academic major
4) IQ score
5) net worth (dollars)
6) third-place finish
7) gender
8) temperature

Question : Indicate whether each of the following terms is qualitative (because it’s a word, letter, or numerical
code representing a class or category); or quantitative (because it’s a number representing an amount or a count).
Answer :
1) age (quantitative) (Discrete if measured in a number of years, minutes, seconds.) (Continuous/Ratio,
However it would be continuous if measured to an exact amount of time passed since the start of something.)
2) family size (quantitative/Discrete)
3) academic major (qualitative/Nominal)
4) IQ score (quantitative/Continuous/Interval)
5) net worth (dollars) (quantitative/Continuous/Interval)
6) third-place finish (qualitative/Ordinal)
7) gender (qualitative/Nominal)
8) temperature (quantitative/Continuous) (temperature in Celsius or Fahrenheit is at an interval scale because
zero is not the lowest possible temperature. In the Kelvin scale, a ratio scale, zero represents a total lack of
thermal energy.)

Frequency Distributions for Quantitative Data:
● A frequency distribution is a collection of observations produced by sorting observations into
classes and showing their frequency (f ) of occurrence in each class.
● Frequency distribution is used to organize the collected data in table form.
● It is a way to summarize the data and it allows to quick visual interpretation of data
● For Example: The following are the scores of 10 students in the G.K. quiz released by Mr. Chris
15, 17, 20, 15, 20, 17, 17, 14, 14, 20. Let's represent this data in frequency distribution and find
out the number of students who got the same marks.
● It is easy to understand the given
information using frequency distribution
and from this we can see that the number
of students who obtained the same
marks.

Types of Frequency Distributions:
1) Grouped Frequency Distribution:
● To arrange a large number of observations or data, we use grouped frequency
distribution table. In this, we form class intervals to tally the frequency for the data
that belongs to that particular class interval.
● For Example: Marks obtained by 20 students in the test are as follows. 5, 10, 20,
15, 5, 20, 20, 15, 15, 15, 10, 10, 10, 20, 15, 5, 18, 18, 18, 18. To arrange the data in
grouped table we have to make class intervals.

2) Ungrouped Frequency Distribution:
● In the ungrouped frequency distribution, we don't make class intervals, we write the
accurate frequency of individual data.
● For Example: Marks obtained by 20 students in the test are as follows. 5, 10, 20, 15, 5,
20, 20, 15, 15, 15, 10, 10, 10, 20, 15, 5, 18, 18, 18, 18. To arrange the data in ungrouped
frequency distribution table we have to write the frequency of each individual data.

3) Relative Frequency Distribution:
● Relative frequency distributions show the frequency of each class as a part or fraction of the
total frequency for the entire distribution.
● To convert a frequency distribution into a relative frequency distribution, divide the
frequency for each class by the total frequency for the entire distribution.
● For instance, to obtain the
proportion of .06 for the
class 130–139, divide the
frequency of 3 for that
class by the total
frequency of 53.
● Repeat this process until a
proportion has been
calculated for each class.

4) Cumulative Frequency Distributions:
● Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes.
● To convert a frequency distribution into a cumulative frequency distribution, add the frequency of each
class to the sum of the frequencies of all classes ranked below it. This gives the cumulative frequency
for that class. Begin with the lowest-ranked class in the frequency distribution and work upward,
finding the cumulative frequencies in ascending order.
● Cumulative percentages are often referred
to as percentile ranks.
● The percentile rank of a score indicates the
percentage of scores in the entire
distribution with similar or smaller values
than that score.

Frequency Distributions for Qualitative Data:
Nominal Qualitative Data
● Frequency distributions for qualitative data are easy to construct. Simply determine the
frequency with which observations occupy each class.
● For example:
○ In this Facebook profile survey, the frequency distribution reveals that Yes
replies are approximately twice as prevalent as No replies.

Ordinal Qualitative Data
● When qualitative data have an ordinal level of measurement because observations can be
ordered from least to most, that order should be preserved in the frequency table.
● For example:
○ Here, Military ranks are listed in descending order from general to lieutenant
○ if measurement is ordinal because observations can be ordered from least to
most, cumulative frequencies (and cumulative percentages) can be used.

Question : Construct a frequency distribution for ungrouped data.
Students in a theater arts appreciation class rated a classic film on a 10-point scale, ranging from 1 (poor)
to 10 (excellent), as follows:
Answer :

Question : Construct a frequency distribution for grouped data. The IQ scores for a group of 35 high
school dropouts are as follows:
Answer : Calculating the class width (let’s desired classes 10)

Question : GRE scores for a group of graduate school applicants are distributed as follows:
1) Convert to a relative frequency distribution. When calculating proportions,
round numbers to two digits to the right of the decimal point.
2) Convert to a cumulative frequency distribution.
3) Convert to a cumulative percent frequency distribution.
Answer 1) :

Answer 2) & 3) :

Question : Movie ratings reflect ordinal measurement because they can be ordered from most to least
restrictive: NC-17, R, PG-13, PG, and G. The ratings of some films shown recently in San Francisco are
as follows:
Answer :
(a) Construct a frequency distribution.
(b) Convert to relative frequencies, expressed as percentages.
(c) Construct a cumulative frequency distribution.
(d) Find the approximate percentile rank for those films with a PG rating.

Graphs for Quantitative Data:
● Histograms
○ Equal units along the horizontal axis (the X axis, or abscissa) reflect the various
class intervals of the frequency distribution.
○ Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency. (The units along the vertical axis do not have to be the same width as
those along the horizontal axis.)
○ The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.

● For example:

● Frequency Polygon
● An important variation on a histogram is
the frequency polygon, or line graph.
● Frequency polygons may be constructed
directly from frequency distributions.
● For example:

Question : The following frequency distribution shows the annual incomes in dollars for a group of college
graduates.
(a) Construct a histogram.
(b) Construct a frequency polygon.
Answer :

● Stem and Leaf Displays
● Another technique for summarizing quantitative data is a stem and leaf display.
● Stem and Leaf Display is a way for presenting quantitative data in a graphical format,
similar to a histogram, to assist in visualizing the shape of a distribution.
● For example:
● For example:

Question : Construct a stem and leaf display for the following IQ scores obtained from a group offour-year-
old children.
Answer :

Graphs for Qualitative Data:
● Bar graph
○ Generally used for qualitative data.
○ Gaps are placed between adjacent bars of bar graphs to emphasize the discontinuous
nature of qualitative data.
○ A bar graph also can be used with quantitative data to emphasize the discontinuous
nature of a discrete variable, such as the number of children in a family.
● For example:

Typical Distribution Curve Shapes:
● Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape.

● Normal: Any distribution that approximates the normal shape
● Bimodal: Any distribution that approximates the bimodal shape
● Positively Skewed Distribution: A distribution that includes a few extreme observations in the
positive direction (to the right of the majority of observations).
● Negatively Skewed Distribution: A distribution that includes a few extreme observations in the
negative direction (to the left of the majority of observations).

Describing Data with Averages:
● Mode:
○ The mode reflects the value of the most frequently occurring score.
○ For example:
Four years is the
modal term, since the
greatest number of
presidents, 7, served
this term.

Question : Determine the mode for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
mode = 63
Question : The owner of a new car conducts six gas mileage tests and obtains the
following results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, 26.9.
Find the mode for these data.
mode = 27.4

Median:
● The median reflects the middle value when observations are ordered from least to
most.
○ The value of the median always reflects the value of middle-ranked scores, not
the position of these scores among the set of ordered scores.
● When you have an odd number of data points, the median is the value in the middle
of your data set.
● With an even number of data points, there are two values in the middle, so the
median is their mean.

→ Odd-numbered data set:
Step 1: Order your values from low to high.
Step 2: Locate the median
Middle Position = (n+1)/2 = (11+1)/2 = 6
So, Median = 6th element = 72

→ Even-numbered data set:
Step 1: Order your values from low to high.
Step 2: Locate the median.
Middle position = (n+1)/2 = (10+1)/2 = 5.5
So, Median = (5th element + 6th element)/2 = (72+76)/2 = 74

Question : Find the median for the following retirement ages: 60, 63, 45, 63, 65, 70,
55, 63, 60, 65, 63.
median = 63
Question : Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
median = 27.15 (halfway between 26.9 and 27.4)

● Mean
○ The mean is found by adding all scores and then dividing by the number of scores.
○ Statisticians distinguish between two types of means—the population mean and the
sample mean—depending on whether the data are viewed as a population (a complete
set of scores) or as a sample (a subset of scores).
○ The mean reflects the values of all scores, not just those that are middle ranked (as with
the median), or those that occur most frequently (as with the mode).

“Sample mean (X-bar) equals the sum of the values
of all scores in the sample (the sum of the variable
X) divided by the sample size n.”
“Population mean (μ) equals the sum of all
scores in the population (sum of the variable
X) divided by the population size N.”
Question : Find the mean for the following retirement ages: 60, 63, 45, 63, 65, 70,
55, 63, 60, 65, 63.
mean = 61.09
Question : Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4, 26.6,
27.4, 26.9.
mean = 27.22

Interpretation of the differences between Mean and Median
● When a distribution is skewed, differences between the values of the mean and median
signal the presence of a skewed distribution.
● If the mean exceeds the median, as it does for the infant death rates, the underlying
distribution is positively skewed because of one or more scores with relatively large
values, such as the very high infant death rates for a number of countries, especially Sierra
Leone.
● On the other hand, if the median exceeds the mean, the underlying distribution is
negatively skewed because of one or more scores with relatively small values.
● In the given example, The median infant death rate of 7 describes the middle-ranked rate.
Finally, the mean infant death rate of 30.00 describes the balance point for all rates.
*Rates per 1000 live births.

Averages with Qualitative Data:
● The mode can always be used with all qualitative data.
● If qualitative data can be ordered from least to most because the level of measurement is
ordinal, the median also can be used.
○ It’s easiest to determine the median class for ordered qualitative data by using relative
frequencies. Cumulate the relative frequencies, working up from the bottom of the
distribution, until the cumulative percentage first equals or exceeds 50 percent.
● In this Example, Since it includes a
cumulative percent of 50, captain is the
median rank of officers in the U.S.
Army.

Question : College students were surveyed about where they would most like to spend their spring break:
Daytona Beach (DB), Cancun, Mexico (C), South Padre Island (SP), Lake Havasu (LH), or other (O). The
results were as follows:
● Find the mode and, if possible, the median.
Answer :
● mode = DB (Daytona Beach)
● Impossible to find the median when qualitative data are unordered, with only nominal
measurement.

Measures of variability:
● measures of the amount by which scores are dispersed or scattered in a distribution
● measures of variability define how far away the data points tend to fall from the center
● low variability is ideal because it means that you can better predict information about the
population based on sample data
● high variability means that the values are less consistent, so it’s harder to make predictions
● There are several measures of variability, including
○ the range,
○ the interquartile range,
○ the variance, and most important
○ the standard deviation

● For distribution A with the least (zero) variability, all seven scores have the same value (10).
● For distribution B with intermediate variability, the values of scores vary slightly (one 9 and one 11), and
● For distribution C with most variability, they vary even more (one 7, two 9s, two 11s, and one 13).

Range:
● range is the difference between the largest and smallest scores.
● For Example: Let we have 8 data points from Sample A.
Data (minutes) 72 110 134 190 238 287 305 324
The highest value (H) is 324 and the lowest (L) is 72.
R = H – L
R = 324 – 72 = 252
The range of your data is 252 minutes.
● Because only 2 numbers are used in finding range, so, the range is influenced by outliers
and doesn’t give you any information about the distribution of values. It’s best used in
combination with other measures.

● In distribution A, the least variable (least variability), has the smallest range of 0 (from 10 to 10);
● distribution B, the moderately variable (intermediate variability), has an intermediate range of 2 (from 11 to 9);
● distribution C, the most variable (most variability), has the largest range of 6 (from 13 to 7).

Interquartile Range (IQR):
● The interquartile range gives the spread of the middle of the distribution.
● The interquartile range is the difference of third quartile (Q3) and the first quartile (Q1).
● Interquartile range (IQR), is simply the range for the middle 50 percent of the scores.
● The interquartile range is an especially useful measure of variability for skewed distributions.
● The IQR is also useful for datasets with outliers. Because it’s based on the middle half of the
distribution, it’s less influenced by extreme values.
interquartile range in boxplot

Step 3: Find Q1 and Q3.
Q3 is the median of the second half, So here 81
Step 4: Calculate the interquartile range.
Q1 is the median of the first half, So here 57 and

Step 3: Find Q1 and Q3.
Q1 is the median of the first half, So here 57 and
Q3 is the median of the second half, So here 81
Step 4: Calculate the interquartile range.

Outliers:
● Appearance of one or more very extreme scores in the dataset is called as outliers.
● An outlier is a data point that lies abnormally far away from other values in a dataset.
● For Example:
○ Someone like Elon Musk who has a net worth in the billions of dollars would be
considered an outlier in terms of annual income.
○ Any freedivers who can hold their breath for 10 minutes or longer would be
considered outliers because they can hold their breath much longer than 165
seconds.
Formula to find outliers
[Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
If the value does not fall in the above range it considers outliers.

Variance:
● The variance is the average of squared deviations from the mean. A deviation
from the mean is how far a score lies from the mean. Variance measures how far
each number in the dataset from the mean.
● Variance is the square of the standard deviation.

● For Example:
𝝁 = 𝝁 =

Standard Deviation:
● Standard deviation is a squared root of the variance.
● Low standard deviation indicates data points close to mean.

Example: You grow 20 crystals from a solution and measure the length of each crystal in millimeters. Here is your
data: 9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4. Calculate the sample standard deviation of the length of
the crystals.

sample
sample

Unit - II FDS.pdf

Recommended

Recommended

More Related Content

Similar to Unit - II FDS.pdf

Similar to Unit - II FDS.pdf (20)

More from TamilarasiP13

More from TamilarasiP13 (6)

Recently uploaded

Recently uploaded (20)

Unit - II FDS.pdf