SlideShare a Scribd company logo
1 of 45
Sampling and Data
Dr. Rana - FGCU
What Is (Are?) Statistics?
• Statistics (a discipline) is a science of dealing with data
• It consists of tools and methods to
– collect data, organize data, and interpret the information or draw
conclusion from data
Note: Statistics (plural) sometimes are referred to particular calculations
made from data. For instance, mean, median, percentage, etc. are
statistics, since these are numbers calculated from a set of sample data
collected.
Basic Terms
• Population: A collection, or set, of individuals or objects
or events whose properties are to be analyzed
• Sample: A subset of the population
• Parameter: A numerical value summarizing all the data
of an entire population, for instance, a population mean
• Statistic: A numerical value summarizing the sample
data, for instance, a sample mean
Two Areas of Statistics
Two areas of statistics:
• Descriptive Statistics: collection, presentation,
and description of sample data
• Inferential Statistics: making decisions and
drawing conclusions about populations
What is an Observational Unit?
• The person or thing to which the variable
is observed or measured
– such as a student in the class, is called the
observational/experimental unit or simply a
case
What Are Data?
• Data can be numbers, record names, or
other labels recorded for the observational
unit
• Not all data represented by numbers are
numerical data
– e.g., 1 = male, 2 = female where 1 and 2 are
the indicators of gender
Data Tables
• The following data table clearly shows the
context of the data presented:
• Notice that this data table tells us the
variables (column) and observational units
(row) for these data.
What is a Variable?
• Variables are characteristics recorded
about each individual or thing
• The variables should have a name that
identify What has been measured
What is Statistics Really About?
• Statistics is about variation
• Different observational units may have
different data values for a variable
• Statistics helps us to deal with variation in
order to make sense of data
Two kinds of Variables
• Qualitative, or Attribute, or Categorical Variable
– A variable that identifies a categories for each case, for example, gender.
Note: Arithmetic operations, such as addition and averaging, are not
meaningful for data resulting from a qualitative variable
• Quantitative, or Numerical Variable
– A variable that records measurements or amounts of something and must have
measuring units, for example, height measured in inches.
Note: Arithmetic operations such as addition and averaging, are meaningful for data
resulting from a quantitative variable
Subdividing Variables Further
• Qualitative and quantitative variables may be
further subdivided:
Variable
Qualitative
Quantitative
Nominal
Exp. blood type, zip code, gender, race,
political party, etc.
Continuous
Exp: height of students in class, weight
of students in class, time it takes to get
to school, distance traveled between
classes
Discrete
Exp: No. of students present, No. of red
marbles in a jar
Ordinal
 Socio economic status (“low income”, ”middle income”,
”high income”),
 education level (“high school”, ”BS”, ”MS”, ”PhD”),
 income level (“less than 50K”, “50K-100K”, “over 100K”),
 satisfaction rating (“extremely dislike”, “dislike”, “neutral”,
“like”, “extremely like”)
Key Definitions
• Nominal Variable: A qualitative variable that categorizes (or describes, or names) an
element of a population, for example, color of a car purchased.
• Ordinal Variable: A qualitative variable that incorporates an ordered position, or
ranking, for instance, The variable Age is recorded as young, middle, and old three
possible categories of values.
• Discrete Variable: A quantitative variable that can assume a countable number of
values. That is, the values are the counts, for example, number of cars owned. So, a
discrete variable can assume values corresponding to integer values along a number
line.
• Continuous Variable: A quantitative variable that are measurements such as height,
weight etc. The precision of the values recorded for the variable depends on the
measuring scales used. Therefore, a weight of 120 lbs recorded may actually be
120.1 lbs or 120.14 lb or 120.143 lb etc. if a more accurate scale is used for
measuring. Therefore, a continuous variable can assume any interval value along a
number line, including every possible value between any two values.
Important Reminders!
 In many cases, a discrete and continuous variable may
be distinguished by determining whether the variables
are related to a count or a measurement
 Discrete variables are usually associated with counting
 Continuous variables are usually associated with measurements
Example
• Example: In a student evaluation of instruction at most of
universities, one question asks students to evaluate the
statement “The instructor was generally interested in teaching”
on the following scale:
1 = Disagree Strongly;
2 = Disagree;
3 = Neutral;
4 = Agree;
5 = Agree Strongly.
• Question: Is interest in teaching categorical or quantitative?
Example (cont.)
• Question: Is interest in teaching categorical or
quantitative?
• Since there is an order to these ratings, but there are no
meaning by adding or subtracting two ratings.
• We conclude that variables like interest in teaching are
categorical and are ordinal variables.
Just because variable’s values are numbers, don’t assume that it’s quantitative.
Data Collection
• First problem a statistician faces: how to obtain
the data
• Usually the data are sample data collected from a
portion of the population. It is important to obtain good
or representative sample data
• Statistical Inferences to the population are made based
on statistics obtained from the sample data collected
Process of Data Collection
1. Define the objectives of the survey or experiment
– Example: The problem statement is to assign the new input data
point to one of the two classes (i.e., A or B)
2. Define the variable and population of interest
– Example: two classes of data, namely class A (squares) and Class B
(triangles), K number of Nearest Neighbors
3. Defining the data-collection and data-measuring schemes. This
includes sampling procedures, sample size, and the data-measuring
device (questionnaire, scale, ruler, etc.)
4. Determine the appropriate descriptive or inferential data-analysis
techniques
KNN Algorithm Example
• Two classes of data
– A (squares) and B (triangles)
• The problem statement
– Assign the new input data point to one of the two classes by using the
KNN algorithm
• The first step is to define the value of ‘K’.
– But what does the ‘K’ in the KNN algorithm stand for?
– ‘K’ stands for the number of Nearest Neighbors
KNN Algorithm Example
• Defined the value of ‘K’ as 3.
– Algorithm will consider the three neighbors that are the closest to the
new data point in order to decide the class of this new data point
• The closeness between the data points is calculated by
using measures such as Euclidean or Manhattan distance
• At ‘K’ = 3, the neighbors include two squares and 1 triangle.
– If I were to classify the new data point based on ‘K’ = 3, then it would
be assigned to Class A (squares).
KNN Algorithm Example
• What if the ‘K’ value is set to 7?
– Here, I’m basically telling my algorithm to look for the seven nearest
neighbors and classify the new data point into the class it is most
similar to.
• At ‘K’ = 7, the neighbors include 3 squares and 4 triangles.
– if I were to classify the new data point based on ‘K’ = 7, then it would
be assigned to Class B (triangles) since the majority of its neighbors
were of class B..
KNN Algorithm Example
• What if the ‘K’ value is set to 7?
– Here, I’m basically telling my algorithm to look for the seven nearest
neighbors and classify the new data point into the class it is most
similar to.
• At ‘K’ = 7, the neighbors include 3 squares and 4 triangles.
– if I were to classify the new data point based on ‘K’ = 7, then it would
be assigned to Class B (triangles) since the majority of its neighbors
were of class B..
Consideration while
implementing the KNN
• Consider the image, here we’re going to
measure the distance between P1 and
P2 by using the Euclidian Distance
measure.
• The coordinates for P1 and P2 are (1,4)
and (5,1) respectively.
• The Euclidian Distance can be
calculated:
Steps in KNN
• Step 1: Calculate Euclidean Distance
• Step 2: Get Nearest Neighbors
• Step 3: Make Predictions
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
Steps in KNN using Numpy
• Step 1. Figure out an appropriate distance metric to calculate the
distance between the data points.
• Step 2. Store the distance in an array and sort it according to the
ascending order of their distances (preserving the index i.e. can use
NumPy argsort method).
• Step 3. Select the first K elements in the sorted list.
• Step 4. Perform the majority Voting and the class with the maximum
number of occurrences will be assigned as the new class for the data
point to be classified.
https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
Methods Used to Collect Data
Data can be collected through performing an Experiment or
survey or census:
Experiment: The investigator controls or modifies the
environment and observes the effect on the variable under
study
Census: A 100% survey. Every element of the population is
listed. Seldom used: difficult and time-consuming to compile,
and expensive.
Survey: Data are obtained by sampling some of the
population of interest. The investigator does not modify the
environment.
Continue…
Biased Sampling
Biased Sampling
An unbiased sampling method is one that is not biased
Biased Sampling Method: A sampling method that produces data
which systematically differs from the sampled population
Sampling methods that often result in biased samples:
• Volunteer sample: sample collected from those elements
of the population which chose to contribute the needed
information on their own initiative
• Convenience sample: sample selected from elements of
a population that are easily accessible
Sample Design: The process of selecting sample elements
from the sampling frame
Note: It is important that the sampling frame be representative
of the population
Note: There are many different types of sample designs.
Usually they all fit into two categories: judgment
samples and probability samples.
Sampling Frame: A list of the elements belonging to the
population from which the sample will be drawn
Two types of sample designs
Probability Samples: Samples in which the elements to
be selected are drawn on the basis of probability. Each
element in a population has a certain probability of being
selected as part of the sample.
Judgment Samples: Samples that are selected on the
basis of being “typical”
– Items are selected that are representative of the
population. The validity of the results from a
judgment sample reflects the soundness of the
collector’s judgment.
Probability Sampling
Probability sampling includes
random sampling
systematic sampling
stratified sampling
proportional sampling
cluster sampling
Random Sampling
• A sample selected in such a way that every element in the population
has a equal probability of being chosen
• Equivalently, all samples of size n have an equal chance of being
selected
• Random samples are obtained either by sampling
– with replacement from a finite population or without replacement from an infinite
population
 Inherent in the concept of randomness: the next result (or occurrence) is not
predictable
Notes:
 Proper procedure for selecting a random sample: use a random number
generator or a table of random numbers
Example
 Example: An employer is interested in the time it takes
each employee to commute to work each morning. A
random sample of 35 employees will be selected and
their commuting time will be recorded.
1. There are 2712 employees
2. Each employee is numbered: 0001, 0002, 0003, etc., up
to 2712
3. Using four-digit random numbers, a sample is
identified: 1315, 0987, 1125, etc.
Systematic Sampling
Note: The systematic technique is easy to execute. However, it has
some inherent dangers when the sampling frame is repetitive or
cyclical in nature. In these situations, the results may not
approximate a simple random sample.
A sample in which every kth item of the sampling frame is
selected, starting from the first element which is randomly
selected from the first k elements
Example
Suppose you want to obtain a systematic sample
of 8 houses from a street of 120 houses., so
• First, since 120/8=15, choose a random starting point
between 1 and 15. Let’s say, 11.
• Then, choose every 15th house after the 11th house.
The list of houses selected are
11, 26, 41, 56, 71, 86, 101, and 116.
Strartified Sampling
Steps:
• A sample obtained by stratifying or grouping the sampling
frame
• Then selecting a fixed number of items from each of the
strata/groups by means of a simple random sampling
technique
Proportional (or Quota)
Sampling
Steps:
1.A sample obtained by stratifying the sampling frame
2.Then selecting a number of items in proportion to the
size of the strata (or by quota) from each strata by
means of a simple random sampling technique
Example
Suppose that in a company there are 180 staff include:
We are asked to take a proportional sample of 40 staff, stratified according to the
above categories.
• The first step is to calculate the percentage of staff in each group:
% male, full time = (90/180) x 100 = 0.5 x 100 = 50
% male, part time = (18/180) x100 = 0.1 x 100 = 10
% female, full time = (9/180) x 100 = 0.05 x 100 = 5
% female, part time = (63/180) x100 = 0.35 x 100 = 35
• This tells us that of our sample of 40, 50% should be male, full time. 10% should be
male, part time. 5% should be female, full time. 35% should be female, part time.
Therefore,
50% of 40 is 20.
10% of 40 is 4.
5% of 40 is 2.
35% of 40 is 14.
We need to select 20 full time males, 4 part time males, 5 full time females,
and 35 part time females.
Male, full time 90
Male, part time 18
Female, full time 9
Female, part time 63
Cluster Sampling
Steps:
1. A sample obtained by stratifying the sampling frame into
clusters first
2. Then randomly selecting some clusters
3. Finally, the sample will include either all elements or a simple
random sample of some of the elements in each of the
clusters selected
Note: The difference between strata and cluster samplings:
All strata are represented in the sample; but only a subset of clusters
are in the sample.
Example
Step 1: Define your population
As with other forms of sampling,
you must first begin by clearly
defining the population you wish
to study.
Example
Step 2: Divide your sample into clusters
The quality of your clusters and how well they
represent the larger population determines
the validity of your results.
• Each cluster’s population should be as
diverse as possible. You want every potential
characteristic of the entire population to be
represented in each cluster.
• Each cluster should have a similar
distribution of characteristics as the
distribution of the population as a whole.
• Taken together, the clusters should cover the
entire population.
• There not be any overlap between clusters
(i.e. the same people or units do not appear
in more than one cluster).
Example
Step 3: Randomly select clusters to use as
your sample
• If each cluster is itself a mini-representation
of the larger population, randomly selecting
and sampling from the clusters allows you to
imitate simple random sampling, which in
turn supports the validity of your results.
• Conversely, if the clusters are not
representative, then random sampling will
allow you to gather data on a diverse array of
clusters, which should still provide you with
an overview of the population as a whole
Example
Step 4: Collect data from the sample
• You then conduct your study and collect data
from every unit in the selected clusters.
Probability & Statistics
• Probability is the science of making statement
about what will occur when samples are drawn
from a known population.
• Statistics is the science of organizing a sample
data and making inferences about the unknown
population from which the sample is drawn.
Probability is an vehicle of statistics so that the accuracy of statistical
inferences from a sample data to a population can be justified with
its chance of occurring. That is, we want to know the chance a
similar result will occur, if the study is repeated many more times.
Comparison of Probability & Statistics
Probability: Properties of the population are
assumed known. Answer questions about
the sample based on these properties.
Statistics: Use information in the sample to
draw a conclusion about the population
Example
 Example: A jar of M&M’s contains 100 candy pieces, 15
are red. A handful of 10 is selected.
 Example: A handful of 10 M&M’s is selected from a jar
containing 1000 candy pieces. Three M&M’s in
the handful are red.
Probability question:
What is the probability that 3 of the 10 selected are red?
Statistics question:
What is the proportion of red M&M’s in the entire jar?

More Related Content

Similar to Sampling and Data Analysis Essentials

Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptxAbebeNega
 
Business Stats PPT1.pptx
Business Stats PPT1.pptxBusiness Stats PPT1.pptx
Business Stats PPT1.pptxGautamGulati24
 
Statistical techniques for interpreting and reporting quantitative data i
Statistical techniques for interpreting and reporting quantitative data   iStatistical techniques for interpreting and reporting quantitative data   i
Statistical techniques for interpreting and reporting quantitative data iVijayalakshmi Murugesan
 
Probability_and_Statistics_lecture_notes_1.pptx
Probability_and_Statistics_lecture_notes_1.pptxProbability_and_Statistics_lecture_notes_1.pptx
Probability_and_Statistics_lecture_notes_1.pptxAliMurat5
 
Sampling-A compact study of different types of sample
Sampling-A compact study of different types of sampleSampling-A compact study of different types of sample
Sampling-A compact study of different types of sampleAsith Paul.K
 
Levels of measurement
Levels of measurementLevels of measurement
Levels of measurementdebmahuya
 
STAT 1 - Basic-Concepts-in-Statistics.pptx
STAT 1 - Basic-Concepts-in-Statistics.pptxSTAT 1 - Basic-Concepts-in-Statistics.pptx
STAT 1 - Basic-Concepts-in-Statistics.pptxJerryJunCuizon
 
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptxAngelineAbella2
 
Methods of data collection
Methods of data collectionMethods of data collection
Methods of data collectionYogeshSorot
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationInternational advisers
 
Levels of Measurement
Levels of MeasurementLevels of Measurement
Levels of MeasurementSarfraz Ahmad
 
Levels of measurement
Levels of measurementLevels of measurement
Levels of measurementSarfraz Ahmad
 

Similar to Sampling and Data Analysis Essentials (20)

Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptx
 
Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
 
Business Stats PPT1.pptx
Business Stats PPT1.pptxBusiness Stats PPT1.pptx
Business Stats PPT1.pptx
 
Statistical techniques for interpreting and reporting quantitative data i
Statistical techniques for interpreting and reporting quantitative data   iStatistical techniques for interpreting and reporting quantitative data   i
Statistical techniques for interpreting and reporting quantitative data i
 
RM7.ppt
RM7.pptRM7.ppt
RM7.ppt
 
01 Introduction (1).pptx
01 Introduction (1).pptx01 Introduction (1).pptx
01 Introduction (1).pptx
 
Probability_and_Statistics_lecture_notes_1.pptx
Probability_and_Statistics_lecture_notes_1.pptxProbability_and_Statistics_lecture_notes_1.pptx
Probability_and_Statistics_lecture_notes_1.pptx
 
statistics Lesson 1
statistics Lesson 1statistics Lesson 1
statistics Lesson 1
 
Statistics.pptx
Statistics.pptxStatistics.pptx
Statistics.pptx
 
Data Analysis Introduction.pptx
Data Analysis Introduction.pptxData Analysis Introduction.pptx
Data Analysis Introduction.pptx
 
Sampling-A compact study of different types of sample
Sampling-A compact study of different types of sampleSampling-A compact study of different types of sample
Sampling-A compact study of different types of sample
 
Levels of measurement
Levels of measurementLevels of measurement
Levels of measurement
 
STAT 1 - Basic-Concepts-in-Statistics.pptx
STAT 1 - Basic-Concepts-in-Statistics.pptxSTAT 1 - Basic-Concepts-in-Statistics.pptx
STAT 1 - Basic-Concepts-in-Statistics.pptx
 
BMS.ppt
BMS.pptBMS.ppt
BMS.ppt
 
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx
1.-Lecture-Notes-in-Statistics-POWERPOINT.pptx
 
Statistics
StatisticsStatistics
Statistics
 
Methods of data collection
Methods of data collectionMethods of data collection
Methods of data collection
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
 
Levels of Measurement
Levels of MeasurementLevels of Measurement
Levels of Measurement
 
Levels of measurement
Levels of measurementLevels of measurement
Levels of measurement
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Sampling and Data Analysis Essentials

  • 1. Sampling and Data Dr. Rana - FGCU
  • 2. What Is (Are?) Statistics? • Statistics (a discipline) is a science of dealing with data • It consists of tools and methods to – collect data, organize data, and interpret the information or draw conclusion from data Note: Statistics (plural) sometimes are referred to particular calculations made from data. For instance, mean, median, percentage, etc. are statistics, since these are numbers calculated from a set of sample data collected.
  • 3. Basic Terms • Population: A collection, or set, of individuals or objects or events whose properties are to be analyzed • Sample: A subset of the population • Parameter: A numerical value summarizing all the data of an entire population, for instance, a population mean • Statistic: A numerical value summarizing the sample data, for instance, a sample mean
  • 4. Two Areas of Statistics Two areas of statistics: • Descriptive Statistics: collection, presentation, and description of sample data • Inferential Statistics: making decisions and drawing conclusions about populations
  • 5. What is an Observational Unit? • The person or thing to which the variable is observed or measured – such as a student in the class, is called the observational/experimental unit or simply a case
  • 6. What Are Data? • Data can be numbers, record names, or other labels recorded for the observational unit • Not all data represented by numbers are numerical data – e.g., 1 = male, 2 = female where 1 and 2 are the indicators of gender
  • 7. Data Tables • The following data table clearly shows the context of the data presented: • Notice that this data table tells us the variables (column) and observational units (row) for these data.
  • 8. What is a Variable? • Variables are characteristics recorded about each individual or thing • The variables should have a name that identify What has been measured
  • 9. What is Statistics Really About? • Statistics is about variation • Different observational units may have different data values for a variable • Statistics helps us to deal with variation in order to make sense of data
  • 10. Two kinds of Variables • Qualitative, or Attribute, or Categorical Variable – A variable that identifies a categories for each case, for example, gender. Note: Arithmetic operations, such as addition and averaging, are not meaningful for data resulting from a qualitative variable • Quantitative, or Numerical Variable – A variable that records measurements or amounts of something and must have measuring units, for example, height measured in inches. Note: Arithmetic operations such as addition and averaging, are meaningful for data resulting from a quantitative variable
  • 11. Subdividing Variables Further • Qualitative and quantitative variables may be further subdivided: Variable Qualitative Quantitative Nominal Exp. blood type, zip code, gender, race, political party, etc. Continuous Exp: height of students in class, weight of students in class, time it takes to get to school, distance traveled between classes Discrete Exp: No. of students present, No. of red marbles in a jar Ordinal  Socio economic status (“low income”, ”middle income”, ”high income”),  education level (“high school”, ”BS”, ”MS”, ”PhD”),  income level (“less than 50K”, “50K-100K”, “over 100K”),  satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”)
  • 12. Key Definitions • Nominal Variable: A qualitative variable that categorizes (or describes, or names) an element of a population, for example, color of a car purchased. • Ordinal Variable: A qualitative variable that incorporates an ordered position, or ranking, for instance, The variable Age is recorded as young, middle, and old three possible categories of values. • Discrete Variable: A quantitative variable that can assume a countable number of values. That is, the values are the counts, for example, number of cars owned. So, a discrete variable can assume values corresponding to integer values along a number line. • Continuous Variable: A quantitative variable that are measurements such as height, weight etc. The precision of the values recorded for the variable depends on the measuring scales used. Therefore, a weight of 120 lbs recorded may actually be 120.1 lbs or 120.14 lb or 120.143 lb etc. if a more accurate scale is used for measuring. Therefore, a continuous variable can assume any interval value along a number line, including every possible value between any two values.
  • 13. Important Reminders!  In many cases, a discrete and continuous variable may be distinguished by determining whether the variables are related to a count or a measurement  Discrete variables are usually associated with counting  Continuous variables are usually associated with measurements
  • 14. Example • Example: In a student evaluation of instruction at most of universities, one question asks students to evaluate the statement “The instructor was generally interested in teaching” on the following scale: 1 = Disagree Strongly; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Agree Strongly. • Question: Is interest in teaching categorical or quantitative?
  • 15. Example (cont.) • Question: Is interest in teaching categorical or quantitative? • Since there is an order to these ratings, but there are no meaning by adding or subtracting two ratings. • We conclude that variables like interest in teaching are categorical and are ordinal variables. Just because variable’s values are numbers, don’t assume that it’s quantitative.
  • 16. Data Collection • First problem a statistician faces: how to obtain the data • Usually the data are sample data collected from a portion of the population. It is important to obtain good or representative sample data • Statistical Inferences to the population are made based on statistics obtained from the sample data collected
  • 17. Process of Data Collection 1. Define the objectives of the survey or experiment – Example: The problem statement is to assign the new input data point to one of the two classes (i.e., A or B) 2. Define the variable and population of interest – Example: two classes of data, namely class A (squares) and Class B (triangles), K number of Nearest Neighbors 3. Defining the data-collection and data-measuring schemes. This includes sampling procedures, sample size, and the data-measuring device (questionnaire, scale, ruler, etc.) 4. Determine the appropriate descriptive or inferential data-analysis techniques
  • 18. KNN Algorithm Example • Two classes of data – A (squares) and B (triangles) • The problem statement – Assign the new input data point to one of the two classes by using the KNN algorithm • The first step is to define the value of ‘K’. – But what does the ‘K’ in the KNN algorithm stand for? – ‘K’ stands for the number of Nearest Neighbors
  • 19. KNN Algorithm Example • Defined the value of ‘K’ as 3. – Algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point • The closeness between the data points is calculated by using measures such as Euclidean or Manhattan distance • At ‘K’ = 3, the neighbors include two squares and 1 triangle. – If I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).
  • 20. KNN Algorithm Example • What if the ‘K’ value is set to 7? – Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to. • At ‘K’ = 7, the neighbors include 3 squares and 4 triangles. – if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B..
  • 21. KNN Algorithm Example • What if the ‘K’ value is set to 7? – Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to. • At ‘K’ = 7, the neighbors include 3 squares and 4 triangles. – if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B..
  • 22. Consideration while implementing the KNN • Consider the image, here we’re going to measure the distance between P1 and P2 by using the Euclidian Distance measure. • The coordinates for P1 and P2 are (1,4) and (5,1) respectively. • The Euclidian Distance can be calculated:
  • 23. Steps in KNN • Step 1: Calculate Euclidean Distance • Step 2: Get Nearest Neighbors • Step 3: Make Predictions https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
  • 24. Steps in KNN using Numpy • Step 1. Figure out an appropriate distance metric to calculate the distance between the data points. • Step 2. Store the distance in an array and sort it according to the ascending order of their distances (preserving the index i.e. can use NumPy argsort method). • Step 3. Select the first K elements in the sorted list. • Step 4. Perform the majority Voting and the class with the maximum number of occurrences will be assigned as the new class for the data point to be classified. https://www.askpython.com/python/examples/k-nearest-neighbors-from-scratch
  • 25. Methods Used to Collect Data Data can be collected through performing an Experiment or survey or census: Experiment: The investigator controls or modifies the environment and observes the effect on the variable under study Census: A 100% survey. Every element of the population is listed. Seldom used: difficult and time-consuming to compile, and expensive. Survey: Data are obtained by sampling some of the population of interest. The investigator does not modify the environment.
  • 27. Biased Sampling An unbiased sampling method is one that is not biased Biased Sampling Method: A sampling method that produces data which systematically differs from the sampled population Sampling methods that often result in biased samples: • Volunteer sample: sample collected from those elements of the population which chose to contribute the needed information on their own initiative • Convenience sample: sample selected from elements of a population that are easily accessible
  • 28. Sample Design: The process of selecting sample elements from the sampling frame Note: It is important that the sampling frame be representative of the population Note: There are many different types of sample designs. Usually they all fit into two categories: judgment samples and probability samples. Sampling Frame: A list of the elements belonging to the population from which the sample will be drawn
  • 29. Two types of sample designs Probability Samples: Samples in which the elements to be selected are drawn on the basis of probability. Each element in a population has a certain probability of being selected as part of the sample. Judgment Samples: Samples that are selected on the basis of being “typical” – Items are selected that are representative of the population. The validity of the results from a judgment sample reflects the soundness of the collector’s judgment.
  • 30. Probability Sampling Probability sampling includes random sampling systematic sampling stratified sampling proportional sampling cluster sampling
  • 31. Random Sampling • A sample selected in such a way that every element in the population has a equal probability of being chosen • Equivalently, all samples of size n have an equal chance of being selected • Random samples are obtained either by sampling – with replacement from a finite population or without replacement from an infinite population  Inherent in the concept of randomness: the next result (or occurrence) is not predictable Notes:  Proper procedure for selecting a random sample: use a random number generator or a table of random numbers
  • 32. Example  Example: An employer is interested in the time it takes each employee to commute to work each morning. A random sample of 35 employees will be selected and their commuting time will be recorded. 1. There are 2712 employees 2. Each employee is numbered: 0001, 0002, 0003, etc., up to 2712 3. Using four-digit random numbers, a sample is identified: 1315, 0987, 1125, etc.
  • 33. Systematic Sampling Note: The systematic technique is easy to execute. However, it has some inherent dangers when the sampling frame is repetitive or cyclical in nature. In these situations, the results may not approximate a simple random sample. A sample in which every kth item of the sampling frame is selected, starting from the first element which is randomly selected from the first k elements
  • 34. Example Suppose you want to obtain a systematic sample of 8 houses from a street of 120 houses., so • First, since 120/8=15, choose a random starting point between 1 and 15. Let’s say, 11. • Then, choose every 15th house after the 11th house. The list of houses selected are 11, 26, 41, 56, 71, 86, 101, and 116.
  • 35. Strartified Sampling Steps: • A sample obtained by stratifying or grouping the sampling frame • Then selecting a fixed number of items from each of the strata/groups by means of a simple random sampling technique
  • 36. Proportional (or Quota) Sampling Steps: 1.A sample obtained by stratifying the sampling frame 2.Then selecting a number of items in proportion to the size of the strata (or by quota) from each strata by means of a simple random sampling technique
  • 37. Example Suppose that in a company there are 180 staff include: We are asked to take a proportional sample of 40 staff, stratified according to the above categories. • The first step is to calculate the percentage of staff in each group: % male, full time = (90/180) x 100 = 0.5 x 100 = 50 % male, part time = (18/180) x100 = 0.1 x 100 = 10 % female, full time = (9/180) x 100 = 0.05 x 100 = 5 % female, part time = (63/180) x100 = 0.35 x 100 = 35 • This tells us that of our sample of 40, 50% should be male, full time. 10% should be male, part time. 5% should be female, full time. 35% should be female, part time. Therefore, 50% of 40 is 20. 10% of 40 is 4. 5% of 40 is 2. 35% of 40 is 14. We need to select 20 full time males, 4 part time males, 5 full time females, and 35 part time females. Male, full time 90 Male, part time 18 Female, full time 9 Female, part time 63
  • 38. Cluster Sampling Steps: 1. A sample obtained by stratifying the sampling frame into clusters first 2. Then randomly selecting some clusters 3. Finally, the sample will include either all elements or a simple random sample of some of the elements in each of the clusters selected Note: The difference between strata and cluster samplings: All strata are represented in the sample; but only a subset of clusters are in the sample.
  • 39. Example Step 1: Define your population As with other forms of sampling, you must first begin by clearly defining the population you wish to study.
  • 40. Example Step 2: Divide your sample into clusters The quality of your clusters and how well they represent the larger population determines the validity of your results. • Each cluster’s population should be as diverse as possible. You want every potential characteristic of the entire population to be represented in each cluster. • Each cluster should have a similar distribution of characteristics as the distribution of the population as a whole. • Taken together, the clusters should cover the entire population. • There not be any overlap between clusters (i.e. the same people or units do not appear in more than one cluster).
  • 41. Example Step 3: Randomly select clusters to use as your sample • If each cluster is itself a mini-representation of the larger population, randomly selecting and sampling from the clusters allows you to imitate simple random sampling, which in turn supports the validity of your results. • Conversely, if the clusters are not representative, then random sampling will allow you to gather data on a diverse array of clusters, which should still provide you with an overview of the population as a whole
  • 42. Example Step 4: Collect data from the sample • You then conduct your study and collect data from every unit in the selected clusters.
  • 43. Probability & Statistics • Probability is the science of making statement about what will occur when samples are drawn from a known population. • Statistics is the science of organizing a sample data and making inferences about the unknown population from which the sample is drawn. Probability is an vehicle of statistics so that the accuracy of statistical inferences from a sample data to a population can be justified with its chance of occurring. That is, we want to know the chance a similar result will occur, if the study is repeated many more times.
  • 44. Comparison of Probability & Statistics Probability: Properties of the population are assumed known. Answer questions about the sample based on these properties. Statistics: Use information in the sample to draw a conclusion about the population
  • 45. Example  Example: A jar of M&M’s contains 100 candy pieces, 15 are red. A handful of 10 is selected.  Example: A handful of 10 M&M’s is selected from a jar containing 1000 candy pieces. Three M&M’s in the handful are red. Probability question: What is the probability that 3 of the 10 selected are red? Statistics question: What is the proportion of red M&M’s in the entire jar?

Editor's Notes

  1. A discrete variable is a variable whose value is obtained by counting. Examples: number of students present. number of red marbles in a jar. A continuous variable is a variable whose value is obtained by measuring.
  2. In the above image, we have two classes of data, namely class A (squares) and Class B (triangles) The problem statement is to assign the new input data point to one of the two classes by using the KNN algorithm The first step in the KNN algorithm is to define the value of ‘K’. But what does the ‘K’ in the KNN algorithm stand for? ‘K’ stands for the number of Nearest Neighbors and hence the name K Nearest Neighbors (KNN).
  3. In the above image, I’ve defined the value of ‘K’ as 3. This means that the algorithm will consider the three neighbors that are the closest to the new data point in order to decide the class of this new data point. The closeness between the data points is calculated by using measures such as Euclidean and Manhattan distance, which I’ll be explaining below. At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it would be assigned to Class A (squares).
  4. But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to. At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
  5. But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to look for the seven nearest neighbors and classify the new data point into the class it is most similar to. At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B (triangles) since the majority of its neighbors were of class B.
  6. In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog. Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how. It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
  7. In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog. Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how. It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
  8. In practice, there’s a lot more to consider while implementing the KNN algorithm. This will be discussed in the demo section of the blog. Earlier I mentioned that KNN uses Euclidean distance as a measure to check the distance between a new data point and its neighbors, let’s see how. It is as simple as that! KNN makes use of simple measure in order to solve complex problems, this is one of the reasons why KNN is such a commonly used algorithm.
  9. Judgmental sampling, also called purposive sampling or authoritative sampling, is a non-probability sampling technique in which the sample members are chosen only on the basis of the researcher’s knowledge and judgment. As the researcher’s knowledge is instrumental in creating a sample in this sampling technique, there are chances that the results obtained will be highly accurate with a minimum margin of error. The process of selecting a sample using judgmental sampling involves the researchers carefully picking and choosing each item to be a part of the sample. The researcher’s knowledge is primary in this sampling process as the members of the sample are not randomly chosen.