Sampling and Data Analysis Essentials

Sampling and Data
Dr. Rana - FGCU

What Is (Are?) Statistics?
• Statistics (a discipline) is a science of dealing with data
• It consists of tools and methods to
– collect data, organize data, and interpret the information or draw
conclusion from data
Note: Statistics (plural) sometimes are referred to particular calculations
made from data. For instance, mean, median, percentage, etc. are
statistics, since these are numbers calculated from a set of sample data
collected.

Basic Terms
• Population: A collection, or set, of individuals or objects
or events whose properties are to be analyzed
• Sample: A subset of the population
• Parameter: A numerical value summarizing all the data
of an entire population, for instance, a population mean
• Statistic: A numerical value summarizing the sample
data, for instance, a sample mean

Two Areas of Statistics
Two areas of statistics:
• Descriptive Statistics: collection, presentation,
and description of sample data
• Inferential Statistics: making decisions and
drawing conclusions about populations

What is an Observational Unit?
• The person or thing to which the variable
is observed or measured
– such as a student in the class, is called the
observational/experimental unit or simply a
case

What Are Data?
• Data can be numbers, record names, or
other labels recorded for the observational
unit
• Not all data represented by numbers are
numerical data
– e.g., 1 = male, 2 = female where 1 and 2 are
the indicators of gender

Data Tables
• The following data table clearly shows the
context of the data presented:
• Notice that this data table tells us the
variables (column) and observational units
(row) for these data.

What is a Variable?
• Variables are characteristics recorded
about each individual or thing
• The variables should have a name that
identify What has been measured

What is Statistics Really About?
• Statistics is about variation
• Different observational units may have
different data values for a variable
• Statistics helps us to deal with variation in
order to make sense of data

Two kinds of Variables
• Qualitative, or Attribute, or Categorical Variable
– A variable that identifies a categories for each case, for example, gender.
Note: Arithmetic operations, such as addition and averaging, are not
meaningful for data resulting from a qualitative variable
• Quantitative, or Numerical Variable
– A variable that records measurements or amounts of something and must have
measuring units, for example, height measured in inches.
Note: Arithmetic operations such as addition and averaging, are meaningful for data
resulting from a quantitative variable

Subdividing Variables Further
• Qualitative and quantitative variables may be
further subdivided:
Variable
Qualitative
Quantitative
Nominal
Exp. blood type, zip code, gender, race,
political party, etc.
Continuous
Exp: height of students in class, weight
of students in class, time it takes to get
to school, distance traveled between
classes
Discrete
Exp: No. of students present, No. of red
marbles in a jar
Ordinal
 Socio economic status (“low income”, ”middle income”,
”high income”),
 education level (“high school”, ”BS”, ”MS”, ”PhD”),
 income level (“less than 50K”, “50K-100K”, “over 100K”),
 satisfaction rating (“extremely dislike”, “dislike”, “neutral”,
“like”, “extremely like”)

Key Definitions
• Nominal Variable: A qualitative variable that categorizes (or describes, or names) an
element of a population, for example, color of a car purchased.
• Ordinal Variable: A qualitative variable that incorporates an ordered position, or
ranking, for instance, The variable Age is recorded as young, middle, and old three
possible categories of values.
• Discrete Variable: A quantitative variable that can assume a countable number of
values. That is, the values are the counts, for example, number of cars owned. So, a
discrete variable can assume values corresponding to integer values along a number
line.
• Continuous Variable: A quantitative variable that are measurements such as height,
weight etc. The precision of the values recorded for the variable depends on the
measuring scales used. Therefore, a weight of 120 lbs recorded may actually be
120.1 lbs or 120.14 lb or 120.143 lb etc. if a more accurate scale is used for
measuring. Therefore, a continuous variable can assume any interval value along a
number line, including every possible value between any two values.

Important Reminders!
 In many cases, a discrete and continuous variable may
be distinguished by determining whether the variables
are related to a count or a measurement
 Discrete variables are usually associated with counting
 Continuous variables are usually associated with measurements

Example
• Example: In a student evaluation of instruction at most of
universities, one question asks students to evaluate the
statement “The instructor was generally interested in teaching”
on the following scale:
1 = Disagree Strongly;
2 = Disagree;
3 = Neutral;
4 = Agree;
5 = Agree Strongly.
• Question: Is interest in teaching categorical or quantitative?

Example (cont.)
• Question: Is interest in teaching categorical or
quantitative?
• Since there is an order to these ratings, but there are no
meaning by adding or subtracting two ratings.
• We conclude that variables like interest in teaching are
categorical and are ordinal variables.
Just because variable’s values are numbers, don’t assume that it’s quantitative.

Data Collection
• First problem a statistician faces: how to obtain
the data
• Usually the data are sample data collected from a
portion of the population. It is important to obtain good
or representative sample data
• Statistical Inferences to the population are made based
on statistics obtained from the sample data collected

Process of Data Collection
1. Define the objectives of the survey or experiment
– Example: The problem statement is to assign the new input data
point to one of the two classes (i.e., A or B)
2. Define the variable and population of interest
– Example: two classes of data, namely class A (squares) and Class B
(triangles), K number of Nearest Neighbors
3. Defining the data-collection and data-measuring schemes. This
includes sampling procedures, sample size, and the data-measuring
device (questionnaire, scale, ruler, etc.)
4. Determine the appropriate descriptive or inferential data-analysis
techniques

KNN Algorithm Example
• Two classes of data
– A (squares) and B (triangles)
• The problem statement
– Assign the new input data point to one of the two classes by using the
KNN algorithm
• The first step is to define the value of ‘K’.
– But what does the ‘K’ in the KNN algorithm stand for?
– ‘K’ stands for the number of Nearest Neighbors

• Defined the value of ‘K’ as 3.
– Algorithm will consider the three neighbors that are the closest to the
new data point in order to decide the class of this new data point
• The closeness between the data points is calculated by
using measures such as Euclidean or Manhattan distance
• At ‘K’ = 3, the neighbors include two squares and 1 triangle.
– If I were to classify the new data point based on ‘K’ = 3, then it would
be assigned to Class A (squares).

• What if the ‘K’ value is set to 7?
– Here, I’m basically telling my algorithm to look for the seven nearest
neighbors and classify the new data point into the class it is most
similar to.
• At ‘K’ = 7, the neighbors include 3 squares and 4 triangles.
– if I were to classify the new data point based on ‘K’ = 7, then it would
be assigned to Class B (triangles) since the majority of its neighbors
were of class B..

Consideration while
implementing the KNN
• Consider the image, here we’re going to
measure the distance between P1 and
P2 by using the Euclidian Distance
measure.
• The coordinates for P1 and P2 are (1,4)
and (5,1) respectively.
• The Euclidian Distance can be
calculated:

Steps in KNN
• Step 1: Calculate Euclidean Distance
• Step 2: Get Nearest Neighbors
• Step 3: Make Predictions
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch

Steps in KNN using Numpy
• Step 1. Figure out an appropriate distance metric to calculate the
distance between the data points.
• Step 2. Store the distance in an array and sort it according to the
ascending order of their distances (preserving the index i.e. can use
NumPy argsort method).
• Step 3. Select the first K elements in the sorted list.
• Step 4. Perform the majority Voting and the class with the maximum
number of occurrences will be assigned as the new class for the data
point to be classified.
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch

Methods Used to Collect Data
Data can be collected through performing an Experiment or
survey or census:
Experiment: The investigator controls or modifies the
environment and observes the effect on the variable under
study
Census: A 100% survey. Every element of the population is
listed. Seldom used: difficult and time-consuming to compile,
and expensive.
Survey: Data are obtained by sampling some of the
population of interest. The investigator does not modify the
environment.

Biased Sampling
An unbiased sampling method is one that is not biased
Biased Sampling Method: A sampling method that produces data
which systematically differs from the sampled population
Sampling methods that often result in biased samples:
• Volunteer sample: sample collected from those elements
of the population which chose to contribute the needed
information on their own initiative
• Convenience sample: sample selected from elements of
a population that are easily accessible

Sample Design: The process of selecting sample elements
from the sampling frame
Note: It is important that the sampling frame be representative
of the population
Note: There are many different types of sample designs.
Usually they all fit into two categories: judgment
samples and probability samples.
Sampling Frame: A list of the elements belonging to the
population from which the sample will be drawn

Two types of sample designs
Probability Samples: Samples in which the elements to
be selected are drawn on the basis of probability. Each
element in a population has a certain probability of being
selected as part of the sample.
Judgment Samples: Samples that are selected on the
basis of being “typical”
– Items are selected that are representative of the
population. The validity of the results from a
judgment sample reflects the soundness of the
collector’s judgment.

Probability Sampling
Probability sampling includes
random sampling
systematic sampling
stratified sampling
proportional sampling
cluster sampling

Random Sampling
• A sample selected in such a way that every element in the population
has a equal probability of being chosen
• Equivalently, all samples of size n have an equal chance of being
selected
• Random samples are obtained either by sampling
– with replacement from a finite population or without replacement from an infinite
population
 Inherent in the concept of randomness: the next result (or occurrence) is not
predictable
Notes:
 Proper procedure for selecting a random sample: use a random number
generator or a table of random numbers

Example
 Example: An employer is interested in the time it takes
each employee to commute to work each morning. A
random sample of 35 employees will be selected and
their commuting time will be recorded.
1. There are 2712 employees
2. Each employee is numbered: 0001, 0002, 0003, etc., up
to 2712
3. Using four-digit random numbers, a sample is
identified: 1315, 0987, 1125, etc.

Systematic Sampling
Note: The systematic technique is easy to execute. However, it has
some inherent dangers when the sampling frame is repetitive or
cyclical in nature. In these situations, the results may not
approximate a simple random sample.
A sample in which every kth item of the sampling frame is
selected, starting from the first element which is randomly
selected from the first k elements

Example
Suppose you want to obtain a systematic sample
of 8 houses from a street of 120 houses., so
• First, since 120/8=15, choose a random starting point
between 1 and 15. Let’s say, 11.
• Then, choose every 15th house after the 11th house.
The list of houses selected are
11, 26, 41, 56, 71, 86, 101, and 116.

Strartified Sampling
Steps:
• A sample obtained by stratifying or grouping the sampling
frame
• Then selecting a fixed number of items from each of the
strata/groups by means of a simple random sampling
technique

Proportional (or Quota)
Sampling
Steps:
1.A sample obtained by stratifying the sampling frame
2.Then selecting a number of items in proportion to the
size of the strata (or by quota) from each strata by
means of a simple random sampling technique

Example
Suppose that in a company there are 180 staff include:
We are asked to take a proportional sample of 40 staff, stratified according to the
above categories.
• The first step is to calculate the percentage of staff in each group:
% male, full time = (90/180) x 100 = 0.5 x 100 = 50
% male, part time = (18/180) x100 = 0.1 x 100 = 10
% female, full time = (9/180) x 100 = 0.05 x 100 = 5
% female, part time = (63/180) x100 = 0.35 x 100 = 35
• This tells us that of our sample of 40, 50% should be male, full time. 10% should be
male, part time. 5% should be female, full time. 35% should be female, part time.
Therefore,
50% of 40 is 20.
10% of 40 is 4.
5% of 40 is 2.
35% of 40 is 14.
We need to select 20 full time males, 4 part time males, 5 full time females,
and 35 part time females.
Male, full time 90
Male, part time 18
Female, full time 9
Female, part time 63

Cluster Sampling
Steps:
1. A sample obtained by stratifying the sampling frame into
clusters first
2. Then randomly selecting some clusters
3. Finally, the sample will include either all elements or a simple
random sample of some of the elements in each of the
clusters selected
Note: The difference between strata and cluster samplings:
All strata are represented in the sample; but only a subset of clusters
are in the sample.

Example
Step 1: Define your population
As with other forms of sampling,
you must first begin by clearly
defining the population you wish
to study.

Example
Step 2: Divide your sample into clusters
The quality of your clusters and how well they
represent the larger population determines
the validity of your results.
• Each cluster’s population should be as
diverse as possible. You want every potential
characteristic of the entire population to be
represented in each cluster.
• Each cluster should have a similar
distribution of characteristics as the
distribution of the population as a whole.
• Taken together, the clusters should cover the
entire population.
• There not be any overlap between clusters
(i.e. the same people or units do not appear
in more than one cluster).

Example
Step 3: Randomly select clusters to use as
your sample
• If each cluster is itself a mini-representation
of the larger population, randomly selecting
and sampling from the clusters allows you to
imitate simple random sampling, which in
turn supports the validity of your results.
• Conversely, if the clusters are not
representative, then random sampling will
allow you to gather data on a diverse array of
clusters, which should still provide you with
an overview of the population as a whole

Example
Step 4: Collect data from the sample
• You then conduct your study and collect data
from every unit in the selected clusters.

Probability & Statistics
• Probability is the science of making statement
about what will occur when samples are drawn
from a known population.
• Statistics is the science of organizing a sample
data and making inferences about the unknown
population from which the sample is drawn.
Probability is an vehicle of statistics so that the accuracy of statistical
inferences from a sample data to a population can be justified with
its chance of occurring. That is, we want to know the chance a
similar result will occur, if the study is repeated many more times.

Comparison of Probability & Statistics
Probability: Properties of the population are
assumed known. Answer questions about
the sample based on these properties.
Statistics: Use information in the sample to
draw a conclusion about the population

Example
 Example: A jar of M&M’s contains 100 candy pieces, 15
are red. A handful of 10 is selected.
 Example: A handful of 10 M&M’s is selected from a jar
containing 1000 candy pieces. Three M&M’s in
the handful are red.
Probability question:
What is the probability that 3 of the 10 selected are red?
Statistics question:
What is the proportion of red M&M’s in the entire jar?

Sampling and Data Analysis Essentials

Recommended

Recommended

More Related Content

Similar to Sampling and Data Analysis Essentials

Similar to Sampling and Data Analysis Essentials (20)

Recently uploaded

Recently uploaded (20)

Sampling and Data Analysis Essentials

Editor's Notes