Statistics is the science of collecting, organizing, analyzing, and interpreting data. It involves tools and methods for collecting data, organizing data into tables and graphs, and drawing conclusions from the data through statistical inference. There are two main areas of statistics: descriptive statistics, which involves describing and presenting sample data, and inferential statistics, which involves using sample data to make inferences about the population from which the sample was drawn. Key terms in statistics include population, sample, parameter, statistic, variable, and data. There are different types of variables such as qualitative vs. quantitative, and nominal vs. ordinal vs. discrete vs. continuous. Probability and statistics are related in that probability provides the theoretical basis for statistical inference when making statements about populations
2. What Is (Are?) Statistics?
• Statistics (a discipline) is a science of dealing with data
• It consists of tools and methods to
– collect data, organize data, and interpret the information or draw
conclusion from data
Note: Statistics (plural) sometimes are referred to particular calculations
made from data. For instance, mean, median, percentage, etc. are
statistics, since these are numbers calculated from a set of sample data
collected.
3. Basic Terms
• Population: A collection, or set, of individuals or objects
or events whose properties are to be analyzed
• Sample: A subset of the population
• Parameter: A numerical value summarizing all the data
of an entire population, for instance, a population mean
• Statistic: A numerical value summarizing the sample
data, for instance, a sample mean
4. Two Areas of Statistics
Two areas of statistics:
• Descriptive Statistics: collection, presentation,
and description of sample data
• Inferential Statistics: making decisions and
drawing conclusions about populations
5. What is an Observational Unit?
• The person or thing to which the variable
is observed or measured
– such as a student in the class, is called the
observational/experimental unit or simply a
case
6. What Are Data?
• Data can be numbers, record names, or
other labels recorded for the observational
unit
• Not all data represented by numbers are
numerical data
– e.g., 1 = male, 2 = female where 1 and 2 are
the indicators of gender
7. Data Tables
• The following data table clearly shows the
context of the data presented:
• Notice that this data table tells us the
variables (column) and observational units
(row) for these data.
8. What is a Variable?
• Variables are characteristics recorded
about each individual or thing
• The variables should have a name that
identify What has been measured
9. What is Statistics Really About?
• Statistics is about variation
• Different observational units may have
different data values for a variable
• Statistics helps us to deal with variation in
order to make sense of data
10. Two kinds of Variables
• Qualitative, or Attribute, or Categorical Variable
– A variable that identifies a categories for each case, for example, gender.
Note: Arithmetic operations, such as addition and averaging, are not
meaningful for data resulting from a qualitative variable
• Quantitative, or Numerical Variable
– A variable that records measurements or amounts of something and must have
measuring units, for example, height measured in inches.
Note: Arithmetic operations such as addition and averaging, are meaningful for data
resulting from a quantitative variable
11. Subdividing Variables Further
• Qualitative and quantitative variables may be
further subdivided:
Variable
Qualitative
Quantitative
Nominal
Exp. blood type, zip code, gender, race,
political party, etc.
Continuous
Exp: height of students in class, weight
of students in class, time it takes to get
to school, distance traveled between
classes
Discrete
Exp: No. of students present, No. of red
marbles in a jar
Ordinal
Socio economic status (“low income”, ”middle income”,
”high income”),
education level (“high school”, ”BS”, ”MS”, ”PhD”),
income level (“less than 50K”, “50K-100K”, “over 100K”),
satisfaction rating (“extremely dislike”, “dislike”, “neutral”,
“like”, “extremely like”)
12. Key Definitions
• Nominal Variable: A qualitative variable that categorizes (or describes, or names) an
element of a population, for example, color of a car purchased.
• Ordinal Variable: A qualitative variable that incorporates an ordered position, or
ranking, for instance, The variable Age is recorded as young, middle, and old three
possible categories of values.
• Discrete Variable: A quantitative variable that can assume a countable number of
values. That is, the values are the counts, for example, number of cars owned. So, a
discrete variable can assume values corresponding to integer values along a number
line.
• Continuous Variable: A quantitative variable that are measurements such as height,
weight etc. The precision of the values recorded for the variable depends on the
measuring scales used. Therefore, a weight of 120 lbs recorded may actually be
120.1 lbs or 120.14 lb or 120.143 lb etc. if a more accurate scale is used for
measuring. Therefore, a continuous variable can assume any interval value along a
number line, including every possible value between any two values.
13. Important Reminders!
In many cases, a discrete and continuous variable may
be distinguished by determining whether the variables
are related to a count or a measurement
Discrete variables are usually associated with counting
Continuous variables are usually associated with measurements
14. Example
• Example: In a student evaluation of instruction at most of
universities, one question asks students to evaluate the
statement “The instructor was generally interested in teaching”
on the following scale:
1 = Disagree Strongly;
2 = Disagree;
3 = Neutral;
4 = Agree;
5 = Agree Strongly.
• Question: Is interest in teaching categorical or quantitative?
15. Example (cont.)
• Question: Is interest in teaching categorical or
quantitative?
• Since there is an order to these ratings, but there are no
meaning by adding or subtracting two ratings.
• We conclude that variables like interest in teaching are
categorical and are ordinal variables.
Just because variable’s values are numbers, don’t assume that it’s quantitative.
16. Data Collection
• First problem a statistician faces: how to obtain
the data
• Usually the data are sample data collected from a
portion of the population. It is important to obtain good
or representative sample data
• Statistical Inferences to the population are made based
on statistics obtained from the sample data collected
17. Process of Data Collection
1. Define the objectives of the survey or experiment
– Example: The problem statement is to assign the new input data
point to one of the two classes (i.e., A or B)
2. Define the variable and population of interest
– Example: two classes of data, namely class A (squares) and Class B
(triangles), K number of Nearest Neighbors
3. Defining the data-collection and data-measuring schemes. This
includes sampling procedures, sample size, and the data-measuring
device (questionnaire, scale, ruler, etc.)
4. Determine the appropriate descriptive or inferential data-analysis
techniques
18. KNN Algorithm Example
• Two classes of data
– A (squares) and B (triangles)
• The problem statement
– Assign the new input data point to one of the two classes by using the
KNN algorithm
• The first step is to define the value of ‘K’.
– But what does the ‘K’ in the KNN algorithm stand for?
– ‘K’ stands for the number of Nearest Neighbors
19. KNN Algorithm Example
• Defined the value of ‘K’ as 3.
– Algorithm will consider the three neighbors that are the closest to the
new data point in order to decide the class of this new data point
• The closeness between the data points is calculated by
using measures such as Euclidean or Manhattan distance
• At ‘K’ = 3, the neighbors include two squares and 1 triangle.
– If I were to classify the new data point based on ‘K’ = 3, then it would
be assigned to Class A (squares).
20. KNN Algorithm Example
• What if the ‘K’ value is set to 7?
– Here, I’m basically telling my algorithm to look for the seven nearest
neighbors and classify the new data point into the class it is most
similar to.
• At ‘K’ = 7, the neighbors include 3 squares and 4 triangles.
– if I were to classify the new data point based on ‘K’ = 7, then it would
be assigned to Class B (triangles) since the majority of its neighbors
were of class B..
21. KNN Algorithm Example
• What if the ‘K’ value is set to 7?
– Here, I’m basically telling my algorithm to look for the seven nearest
neighbors and classify the new data point into the class it is most
similar to.
• At ‘K’ = 7, the neighbors include 3 squares and 4 triangles.
– if I were to classify the new data point based on ‘K’ = 7, then it would
be assigned to Class B (triangles) since the majority of its neighbors
were of class B..
22. Consideration while
implementing the KNN
• Consider the image, here we’re going to
measure the distance between P1 and
P2 by using the Euclidian Distance
measure.
• The coordinates for P1 and P2 are (1,4)
and (5,1) respectively.
• The Euclidian Distance can be
calculated:
23. Steps in KNN
• Step 1: Calculate Euclidean Distance
• Step 2: Get Nearest Neighbors
• Step 3: Make Predictions
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
24. Steps in KNN using Numpy
• Step 1. Figure out an appropriate distance metric to calculate the
distance between the data points.
• Step 2. Store the distance in an array and sort it according to the
ascending order of their distances (preserving the index i.e. can use
NumPy argsort method).
• Step 3. Select the first K elements in the sorted list.
• Step 4. Perform the majority Voting and the class with the maximum
number of occurrences will be assigned as the new class for the data
point to be classified.
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
25. Methods Used to Collect Data
Data can be collected through performing an Experiment or
survey or census:
Experiment: The investigator controls or modifies the
environment and observes the effect on the variable under
study
Census: A 100% survey. Every element of the population is
listed. Seldom used: difficult and time-consuming to compile,
and expensive.
Survey: Data are obtained by sampling some of the
population of interest. The investigator does not modify the
environment.
27. Biased Sampling
An unbiased sampling method is one that is not biased
Biased Sampling Method: A sampling method that produces data
which systematically differs from the sampled population
Sampling methods that often result in biased samples:
• Volunteer sample: sample collected from those elements
of the population which chose to contribute the needed
information on their own initiative
• Convenience sample: sample selected from elements of
a population that are easily accessible
28. Sample Design: The process of selecting sample elements
from the sampling frame
Note: It is important that the sampling frame be representative
of the population
Note: There are many different types of sample designs.
Usually they all fit into two categories: judgment
samples and probability samples.
Sampling Frame: A list of the elements belonging to the
population from which the sample will be drawn
29. Two types of sample designs
Probability Samples: Samples in which the elements to
be selected are drawn on the basis of probability. Each
element in a population has a certain probability of being
selected as part of the sample.
Judgment Samples: Samples that are selected on the
basis of being “typical”
– Items are selected that are representative of the
population. The validity of the results from a
judgment sample reflects the soundness of the
collector’s judgment.
31. Random Sampling
• A sample selected in such a way that every element in the population
has a equal probability of being chosen
• Equivalently, all samples of size n have an equal chance of being
selected
• Random samples are obtained either by sampling
– with replacement from a finite population or without replacement from an infinite
population
Inherent in the concept of randomness: the next result (or occurrence) is not
predictable
Notes:
Proper procedure for selecting a random sample: use a random number
generator or a table of random numbers
32. Example
Example: An employer is interested in the time it takes
each employee to commute to work each morning. A
random sample of 35 employees will be selected and
their commuting time will be recorded.
1. There are 2712 employees
2. Each employee is numbered: 0001, 0002, 0003, etc., up
to 2712
3. Using four-digit random numbers, a sample is
identified: 1315, 0987, 1125, etc.
33. Systematic Sampling
Note: The systematic technique is easy to execute. However, it has
some inherent dangers when the sampling frame is repetitive or
cyclical in nature. In these situations, the results may not
approximate a simple random sample.
A sample in which every kth item of the sampling frame is
selected, starting from the first element which is randomly
selected from the first k elements
34. Example
Suppose you want to obtain a systematic sample
of 8 houses from a street of 120 houses., so
• First, since 120/8=15, choose a random starting point
between 1 and 15. Let’s say, 11.
• Then, choose every 15th house after the 11th house.
The list of houses selected are
11, 26, 41, 56, 71, 86, 101, and 116.
35. Strartified Sampling
Steps:
• A sample obtained by stratifying or grouping the sampling
frame
• Then selecting a fixed number of items from each of the
strata/groups by means of a simple random sampling
technique
36. Proportional (or Quota)
Sampling
Steps:
1.A sample obtained by stratifying the sampling frame
2.Then selecting a number of items in proportion to the
size of the strata (or by quota) from each strata by
means of a simple random sampling technique
37. Example
Suppose that in a company there are 180 staff include:
We are asked to take a proportional sample of 40 staff, stratified according to the
above categories.
• The first step is to calculate the percentage of staff in each group:
% male, full time = (90/180) x 100 = 0.5 x 100 = 50
% male, part time = (18/180) x100 = 0.1 x 100 = 10
% female, full time = (9/180) x 100 = 0.05 x 100 = 5
% female, part time = (63/180) x100 = 0.35 x 100 = 35
• This tells us that of our sample of 40, 50% should be male, full time. 10% should be
male, part time. 5% should be female, full time. 35% should be female, part time.
Therefore,
50% of 40 is 20.
10% of 40 is 4.
5% of 40 is 2.
35% of 40 is 14.
We need to select 20 full time males, 4 part time males, 5 full time females,
and 35 part time females.
Male, full time 90
Male, part time 18
Female, full time 9
Female, part time 63
38. Cluster Sampling
Steps:
1. A sample obtained by stratifying the sampling frame into
clusters first
2. Then randomly selecting some clusters
3. Finally, the sample will include either all elements or a simple
random sample of some of the elements in each of the
clusters selected
Note: The difference between strata and cluster samplings:
All strata are represented in the sample; but only a subset of clusters
are in the sample.
39. Example
Step 1: Define your population
As with other forms of sampling,
you must first begin by clearly
defining the population you wish
to study.
40. Example
Step 2: Divide your sample into clusters
The quality of your clusters and how well they
represent the larger population determines
the validity of your results.
• Each cluster’s population should be as
diverse as possible. You want every potential
characteristic of the entire population to be
represented in each cluster.
• Each cluster should have a similar
distribution of characteristics as the
distribution of the population as a whole.
• Taken together, the clusters should cover the
entire population.
• There not be any overlap between clusters
(i.e. the same people or units do not appear
in more than one cluster).
41. Example
Step 3: Randomly select clusters to use as
your sample
• If each cluster is itself a mini-representation
of the larger population, randomly selecting
and sampling from the clusters allows you to
imitate simple random sampling, which in
turn supports the validity of your results.
• Conversely, if the clusters are not
representative, then random sampling will
allow you to gather data on a diverse array of
clusters, which should still provide you with
an overview of the population as a whole
42. Example
Step 4: Collect data from the sample
• You then conduct your study and collect data
from every unit in the selected clusters.
43. Probability & Statistics
• Probability is the science of making statement
about what will occur when samples are drawn
from a known population.
• Statistics is the science of organizing a sample
data and making inferences about the unknown
population from which the sample is drawn.
Probability is an vehicle of statistics so that the accuracy of statistical
inferences from a sample data to a population can be justified with
its chance of occurring. That is, we want to know the chance a
similar result will occur, if the study is repeated many more times.
44. Comparison of Probability & Statistics
Probability: Properties of the population are
assumed known. Answer questions about
the sample based on these properties.
Statistics: Use information in the sample to
draw a conclusion about the population
45. Example
Example: A jar of M&M’s contains 100 candy pieces, 15
are red. A handful of 10 is selected.
Example: A handful of 10 M&M’s is selected from a jar
containing 1000 candy pieces. Three M&M’s in
the handful are red.
Probability question:
What is the probability that 3 of the 10 selected are red?
Statistics question:
What is the proportion of red M&M’s in the entire jar?
Editor's Notes
A discrete variable is a variable whose value is obtained by counting. Examples: number of students present. number of red marbles in a jar.
A continuous variable is a variable whose value is obtained by measuring.
In the above image, we have two classes of data, namely class A (squares) and Class B (triangles)
The problem statement is to assign the new input data point to one of the two classes by using the KNN algorithm
The first step in the KNN algorithm is to define the value of ‘K’. But what does the ‘K’ in the KNN algorithm stand for?
‘K’ stands for the number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).
In the above image, I’ve defined the value of ‘K’ as 3. This means that the algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point.
The closeness between the data points is calculated by using measures such as Euclidean and Manhattan distance, which I’ll be explaining below.
At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).
But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to.
At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to.
At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.
It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.
It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how.
It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
Judgmental sampling, also called purposive sampling or authoritative sampling, is a non-probability sampling technique in which the sample members are chosen only on the basis of the researcher’s knowledge and judgment. As the researcher’s knowledge is instrumental in creating a sample in this sampling technique, there are chances that the results obtained will be highly accurate with a minimum margin of error.
The process of selecting a sample using judgmental sampling involves the researchers carefully picking and choosing each item to be a part of the sample. The researcher’s knowledge is primary in this sampling process as the members of the sample are not randomly chosen.