2. Recommended and Reference Books
Topic: Introduction to Statistical
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
3. Outline
• What Is Data?
• Categories Of Data
• What Is Statistics?
• Basic Terminologies In Statistics
• Sampling Techniques
• Types Of Statistics
• Descriptive Statistics
• Statistical Description of Data
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
4. What Is Data
• Data refers to facts and statistics collected together for reference or
analysis.
• Data can be collected, measured and analyzed. It can also be
visualized by using statistical models and graphs.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
5. Categories Of Data…
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
6. Categories Of Data…
• Qualitative Data: Qualitative data deals with characteristics and descriptors
that can’t be easily measured, but can be observed subjectively.
• such as smells, tastes, textures, attractiveness, and color.
• In statistics, qualitative data—sometimes referred to as categorical data—is
data that can be arranged into categories based on physical traits, gender,
colors or anything that does not have a number associated with it.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
7. Categories Of Data…
Nominal Data: Data with no inherent order or ranking such as gender
or race.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
8. Categories Of Data…
• Ordinal Data: Data with an ordered series of information is called ordinal
data.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
9. Categories Of Data…
• Quantitative Data: Quantitative data deals with numbers and things you
can measure objectively. This is further divided into two:
• Discrete Data: Also known as categorical data, it can hold a finite number
of possible values.
• Example: Number of students in a class.
• Continuous Data: Data that can hold an infinite number of possible values.
• Example: Weight of a person.
• So these were the different categories of data. The upcoming sections will
focus on the basic Statistics concepts, so buckle up and get ready to do
some math.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
10. Methods of Data Collection
In an observational study, a researcher observes and measures characteristics
of interest of part of a population.
In an experiment, a treatment is applied to part of a population, and
responses are observed.
A simulation is the use of a mathematical or physical model to reproduce the
conditions of a situation or process.
A survey is an investigation of one or more characteristics of a population.
A census is a measurement of an entire population.
A sampling is a measurement of part of a population.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
11. Basics of Statistics
Definition: Science of collection, presentation, analysis, and reasonable interpretation of data.
• Statistics presents a rigorous scientific method for gaining insight into data.
• suppose we measure the weight of 100 patients in a study.
• With so many measurements, simply looking at the data fails to provide an informative account.
• However statistics can give an instant overall picture of data based on graphical presentation or
numerical summarization irrespective to the number of data points.
• Besides data summarization, another important task of statistics is to make inference and predict
relations of variables.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
12. What is Statistics
• Statistics is an area of applied mathematics concerned with data collection,
analysis, interpretation, and presentation.
• This area of mathematics deals with understanding how data can be used to
solve complex problems.
• Here are a couple of example problems that can be solved by using statistics:
• Your company has created a new drug that may cure cancer. How would you
conduct a test to confirm the drug’s effectiveness?
• You and a friend are at a baseball game, and out of the blue, he offers you a bet
that neither team will hit a home run in that game. Should you take the bet?
• The latest sales data have just come in, and your boss wants you to prepare a
report for management on places where the company could improve its
business. What should you look for? What should you not look for?
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
13. Basic Terminologies In Statistics
• The two most important terminologies in statistics are population
and sample.
• Population: A collection or set of individuals or objects or events
whose properties are to be analyzed
• Sample: A subset of the population is called ‘Sample’. A well-chosen
sample will contain most of the information about a particular
population parameter.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
14. Basic Terminologies In Statistics..
• Now you must be wondering how can one choose a sample that best
represents the entire population.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
15. A Taxonomy of Statistics
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
16. Sampling Techniques…
• Sampling is a statistical method that deals with the selection of individual
observations within a population.
• It is performed to infer statistical knowledge about a population.
• Consider a scenario wherein you’re asked to perform a survey about the
eating habits of teenagers in the US. There are over 42 million teens in the
US at present and this number is growing as you read this blog.
• Is it possible to survey each of these 42 million individuals about their
health? Obviously not! That’s why sampling is used.
• It is a method wherein a sample of the population is studied in order to
draw inference about the entire population.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
17. Sampling Techniques…
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
18. Sampling Techniques…
• Probability Sampling: This is a sampling technique in which samples
from a large population are chosen using the theory of probability.
There are three types of probability sampling:
• Random Sampling: In this method, each member of the population
has an equal chance of being selected in the sample.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
19. Sampling Techniques…
• Systematic Sampling: In Systematic sampling, every nth record is
chosen from the population to be a part of the sample. Refer the
below figure to better understand how Systematic sampling works.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
20. Sampling Techniques…
• Stratified Sampling: In Stratified sampling, a stratum is used to form
samples from a large population. A stratum is a subset of the
population that shares at least one common characteristic. After this,
the random sampling method is used to select a sufficient number of
subjects from each stratum.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
21. Types Of Statistics…
• There are two well-defined types of statistics:
• Descriptive Statistics
• Inferential Statistics
• Descriptive Statistics is a method used to describe and understand
the features of a specific data set by giving short summaries about
the sample and measures of the data.
• Descriptive Statistics is mainly focused upon the main characteristics
of data. It provides a graphical summary of the data.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
22. Types Of Statistics…
• Suppose you want to gift all your classmate’s t-shirts.
• To study the average shirt size of students in a classroom, in descriptive statistics
you would record the shirt size of all students in the class and then you would find
out the maximum, minimum and average shirt size of the class.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
23. Types Of Statistics…
• Inferential statistics makes inferences and predictions about a population
based on a sample of data taken from the population in question.
• Inferential statistics generalizes a large dataset and applies probability to
draw a conclusion.
• It allows us to infer data parameters based on a statistical model using
sample data.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
24. Types Of Statistics…
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
25. Types Of Statistics…
• So, if we consider the same example of finding the average shirt size
of students in a class, in Inferential Statistics, you will take a sample
set of the class, which is basically a few people from the entire class.
• You already have had grouped the class into large, medium and small.
In this method, you basically build a statistical model and expand it
for the entire population in the class.
• So that was a brief understanding of Descriptive and Inferential
Statistics.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
26. Statistical Description of Data
• Statistics describes a numeric set of data by its
• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
• Measures of the center
• A measure of spread
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
27. Measures of Central Tendency
• Measures of the center are statistical measures that represent the
summary of a dataset. There are three main measures of center:
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
28. Measures of Central Tendency…
• Mean: The arithmetic mean (or, simply, mean) is computed by
summing all the observations in the sample and dividing the sum
by the number of
observations.
• For a sample of five household incomes, 6000, 10,000, 10,000,
14000, 50,000 the sample mean is,
X =
6000 + 10000 + 10000 + 14000 + 50000
5
= 18000
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
29. Measures of Central Tendency…
• Median: In a list ranked from smallest measurement to the
highest, the median is the middle value
• In our example of five household incomes, first we rank the
measurements
• 6,000 10,000 10,000 14,000 50,000
• Sample Median is 10,000
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
30. Measures of Central Tendency…
• Mode:
• In nominal data:
• The value which occurs with the greatest frequency
• 6,000 10,000 10,000 14,000 50,000
• Sample Mode is 10,000
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
31. Measures of spread
• A measure of spread, sometimes also called a measure of dispersion,
is used to describe the variability in a sample or population.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
32. Measures of spread…
• Range (present highest and lowest value in a distribution. The
difference between these values is the range)
• Range = Max(𝑥_𝑖) – Min(𝑥_𝑖)
• Variance: It describes how much a random variable differs from its
expected value. It entails
• Standard deviation (the square root of the variance)
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
33. Sample Variance
2 i=1
n
i
2
s =
(x - x )
n -1
S = standard deviation
(square root of variance)
Measures of spread…
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
34. Measures of spread…
• Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
4
1
3
)
5
7
(
)
5
3
(
)
5
5
( 2
2
2
Standard Deviation: Square root of the variance. The standard deviation
of the above example is 2.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
35. Examples
7 7
7 7 7
7
7 8
7 7 7
6 3 2
7 8 13
9
Mean = 7
SD=0
Mean = 7
SD=0.63
Mean = 7
SD=4.04
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
36. Examples…
Problem statement: Daenerys has 20 Dragons. They have the
numbers 9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9,
4. Work out the Standard Deviation.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
37. Examples…
For a Normal distribution approximately,
a) 68% of the measurements fall within one standard deviation around the
mean
b) 95% of the measurements fall within two standard deviations around the
mean
c) 99.7% of the measurements fall within three standard deviations around
the mean
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
38. Suppose the reaction time of a particular drug has a Normal distribution
with a mean of 10 minutes and a standard deviation of 2 minutes
Approximately,
a) 68% of the subjects taking the drug will have reaction time between
8 and 12 minutes
b) 95% of the subjects taking the drug will have reaction tome
between 6 and 14 minutes
c) 99.7% of the subjects taking the drug will have reaction tome
between 4 and 16 minutes
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
39. Examples…
• Quartile: Quartiles tell us about the spread of a data set by breaking
the data set into quarters, just like the median breaks it in half.
• To better understand how quartile and the IQR are calculated, let’s
look at an example.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
40. Examples…
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
41. Examples…
• The first quartile (Q1) lies between the 25th and 26th observation.
• The second quartile (Q2) lies between the 50th and 51st observation.
• The third quartile (Q3) lies between the 75th and 76th observation.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
42. Examples…
• Inter Quartile Range (IQR): It is the measure of variability, based on
dividing a data set into quartiles. The interquartile range is equal to
Q3 minus Q1, i.e. IQR = Q3 – Q1
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
43. Frequency Distribution
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Age Group 1-2 3-4 5-6
Frequency 8 12 6
Frequency Distribution of Age
Grouped Frequency Distribution of Age:
Consider a data set of 26 children of ages 1-6 years. Then the frequency distribution of
variable ‘age’ can be tabulated as follows:
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
44. Frequency Distribution…
Age Group 1-2 3-4 5-6
Frequency 8 12 6
Cumulative Frequency 8 20 26
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26
Cumulative frequency of data in previous page
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
45. Graphical Data
• Two types of statistical presentation of data - graphical and numerical.
• Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern.
• Over all pattern usually described by shape, center, and spread of the data.
• An individual value that falls outside the overall pattern is called an outlier.
• Bar diagram and Pie charts are used for categorical variables.
• Histogram, stem and leaf and Box-plot are used for numerical variable.
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
46. Graphical Data
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall in
each category.
Treatment
Group
Frequency Proportion Percent
(%)
1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7
3 20 (20/60)=0.417 33.3
Total 60 1.00 100
0 20 40 60 80 100 120
1
2
3
Total
1 2 3 Total
Percent (%) 25 41.7 33.3 100
Proportion 0 0 0 1
Frequency 15 25 20 60
Chart Title
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
47. Graphical Data
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in each
category.
Treatment
Group
Frequency Proportion Percent
(%)
1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7
3 20 (20/60)=0.417 33.3
Total 60 1.00 100
Frequency
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
48. Summary
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical
49. Thanks
Course: Data Science Course Code: CS-325 -- Instructor: Sana Ullah Khan, Lecturer. Institute of Computing, KUST -- Email: sana.ullah@kust.edu.pk
Topic: Introduction to Statistical