The document discusses different types of sampling methods used for data streams. It defines data sampling as selecting a representative subset of data points from a larger dataset to identify patterns. There are two types of queries for data streams: ad-hoc queries which are asked once, and standing queries which continuously execute. Common problems with data streams include filtering, counting distinct elements, estimating moments, and finding frequent elements. Applications of data sampling on streams include mining query streams, click streams, social network feeds, sensor networks, telephone call records, and monitoring IP packets.
2. What is Sampling ?
• The sample method involves taking a representative selection of the
population and using the data collected as research information.
• A sample is a “subgroup of a population”.
• As a way of obtaining a group of people or objects to study that were
representative of a large population or universe of interest. (Stacks &
Hocking, 1999)
•
4. Types of Sampling
• Probabilty Sampling:
A sampling process where every single individual elements in the population
have an oppertunity to be choosen as a sample.
• Nonprobabilty Sampling:
A sampling process where every single individual elements in the population
may not have an opportunity to be choosen as a sample.
5. Convenience sample: The researcher chooses a sample that is readily available
in some non-random way.
Example: A researcher polls people as they walk by on the street.
Why it's probably biased: The location and time of day and other factors may
produce a biased sample of people.
Voluntary response sample: The researcher puts out a request for members of
a population to join the sample, and people decide whether or not to be in the
sample.
Example: A TV show host asks his viewers to visit his website and respond to an
online poll.
Why it's probably biased: People who take the time to respond tend to have similarly
strong opinions compared to the rest of the population.
Probabilty Sampling
Bad ways to sample
6. Probabilty Sampling
• Simple Random Sampling
• Stratified sampling
• Systematic sampling
• Cluster Sampling
• Multi stage Sampling
Good ways to sample
7. Simple Random Sampling
• Every element has an equal chance of getting selected to be the part sample.
• It is used when we don’t have any kind of prior information about the target
population.
• Random selection of sample with out any procedure or criteria.
For example: Random selection of
20 students from class of 50
student. Each student has equal
chance of getting selected. Here
probability of selection is 1/50
Why it's good: Random samples are usually fairly representative since they don't favor
certain members.
8. Stratified Sampling
• This technique divides the elements of the population into small subgroups
based on the similarity in such a way that the elements within the group are
homogeneous and heterogeneous among the other subgroups formed.
• And then the elements are randomly selected from each of these subgroups.
• We need to have prior information about the population to create
subgroups.
Example—A student council surveys 100
students by getting random samples of 25
freshmen, 25 sophomores, 25 juniors, and 25
seniors.
Why it's good: A stratified sample guarantees that members from each group will be
represented in the sample, so this sampling method is good when we want some members
from every group.
9. Cluster Sampling
• Process to choose the sample according to sections/ clusters.
• Our entire population is divided into clusters or sections and then the
clusters are randomly selected.
• All the elements of the cluster are used for sampling.
• Clusters are identified using details such as age, sex, location etc.
Cluster sampling can be done in following ways:
• Single Stage Cluster Sampling
• Two Stage Cluster Sampling
10. • Single Stage Cluster Sampling
Entire cluster is selected randomly for
sampling.
Two Stage Cluster Sampling
Here first we randomly select clusters
and then from those selected clusters we
randomly select elements for sampling
11. Cluster Sampling (cont..)
Example: An airline company wants to survey its customers one day, so they
randomly select 55 flights that day and survey every passenger on those
flights.
Why it's good: A cluster sample gets every member from some of the
groups, so it's good when each group reflects the population as a whole.
12. Systematic Clustering
• Here the selection of elements is systematic and not random except the first
element.
• Elements of a sample are chosen at regular intervals of population.
• All the elements are put together in a sequence first where each element
has the equal chance of being selected.
• Example: A principal takes an alphabetized list of student names and picks a
random starting point. Every 20th student is selected to take a survey.
13. For a sample of size n, we divide our population of size N into subgroups of k
elements.
We select our first element randomly from the first subgroup of k elements.
To select other elements of sample, perform following:
We know number of elements in each group is k i.e N/n
So if our first element is n1 then Second element is n1+k i.e n2
Third element n2+k i.e n3 and so on..
Taking an example of N=20, n=5
No of elements in each of the subgroups is N/n i.e 20/5 =4= k
Now, randomly select first element from the first subgroup.
If we select n1= 3, n2 = n1+k = 3+4 = 7, n3 = n2+k = 7+4 = 11
Systematic Clustering (cont..)
14. Area Sampling
Multi-Stage Sampling
• It is the combination of one or more methods
described earlier.
• Population is divided into multiple clusters and
then these clusters are further divided and
grouped into various sub groups based on
similarity.
• One or more clusters can be randomly selected
from each sub-groups.
• This process continues until the cluster can’t be
divided anymore.
• For example country can be divided into states,
cities, urban and rural and all the areas with
similar characteristics can be merged together to
form a sub-groups.
Process which depends on the gegrophical/prospective positions.
15. QUIZ
1. A restaurant leaves comment cards on all of its tables and encourages
customers to participate in a brief survey to learn about their overall
experience. What type of sampling is this?
A: Convenience sampling B: Voluntary response samplingB: Voluntary response sampling
2. A quality control worker at a factory selects the first 10 items she sees
as her sample for the day. What type of sampling is this?
A: Convenience sampling B: Voluntary response samplingA: Convenience sampling
16. 3. Each student at a school has a student identification number.
Counselors have a computer generate 50 random identification numbers
and those students are asked to take a survey.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
A: Simple random sampling
4. A principal orders t-shirts and wants to check some of them to make
sure they were printed properly. She randomly selects 2 of the 10 boxes
of shirts and checks every shirt in those 2 boxes.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random samplingC: Cluster random sampling
17. 5. A school chooses 3 randomly selected athletes from each of its sports
teams to participate in a survey about athletics at the school.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
6. While students are lined up for school pictures, a teacher passes out a
survey to every 10th student.
A: Simple random sampling B: Stratified random sampling
C: Cluster random sampling D: Systematic random sampling
B: Stratified random sampling
D: Systematic random sampling
18. Nonprobabilty Sampling
• Convenience Sampling
• Purpose Sampling/Judgemental Sampling
• Quota Sampling
• Referral /Snowball Sampling: Process of getting a sample by one stage to
another stage after getting recomondation.
19. Convenience Sampling
• Here the samples are selected based on the availability.
• This method is used when the availability of sample is rare and also costly.
• So based on the convenience samples are selected.
• Process of choosing a sample according to suitabilty.
For example: Researchers prefer this during the initial stages of survey
research, as it’s quick and easy to deliver results.
20. Purposive Sampling
• This is based on the intention or the purpose of study.
• Only those elements will be selected from the population which suits the
best for the purpose of our study.
• Choosing a sample because of represent the certain purpose.
For example: If we want to understand the thought process of the people who
are interested in pursuing master’s degree then the selection criteria would be
“Are you interested for Masters in..?”
All the people who respond with a “No” will be excluded from our sample.
21. Quota Sampling
• This type of sampling depends of some pre-set standard.
• It selects the representative sample from the population.
• Proportion of characteristics/ trait in sample should be same as population.
• Elements are selected until exact proportions of certain types of data is
obtained or sufficient data in different categories is collected.
For example: If our population has 45% females and 55% males then our
sample should reflect the same percentage of males and females.
22. Referral /Snowball Sampling
• This technique is used in the situations
where the population is completely
unknown and rare.
• Therefore we will take the help from the
first element which we select for the
population and ask him to recommend
o t h e r e l e m e nt s w h o w i l l f i t t h e
description of the sample needed.
• So this referral technique goes on,
increasing the size of population like a
snowball.
24. Data Sampling?
Data sampling is a statistical analysis technique used to select,
manipulate and analyze a representative subset of data points
in order to identify patterns and trends in the larger data set
being examined.
25. Stream Queries
• There are two ways that queries get asked about streams.
• Ad-hoc Queries: Normal queries asked one time about streams.
• Example: What is the maximum value seen so far in stream S?
• Standing Queries: These queries are, in a sense, permanently
executing, and produce outputs at appropriate times. Queries
that are in principle, asked about the stream at all time.
• Example: Report each maximum value ever seen in stream S.
26. Problems on Data Streams
• Types of queries one wants on answer on a stream:
– Filtering a data stream
• Select elements with property x from the stream
– Counting distinct elements
• Number of distinct elements in the last k elements of the stream
– Estimating moments
• Estimate avg./std. dev. of last k elements
– Finding frequent elements
27. Applications – (1)
• Mining query streams
• Google wants to know what queries are more frequent today than
yesterday
• Mining click streams
• Yahoo wants to know which of its pages are getting an unusual
number of hits in the past hour
• Mining social network news feeds
• e.g., look for trending topics on Twitter, Facebook
27
28. Applications – (2)
• Sensor Networks
• Many sensors feeding into a central controller
• Telephone call records
• Data feeds into customer bills as well as settlements between
telephone companies
• IP packets monitored at a switch
• Gather information for optimal routing
• Detect denial-of-service attacks
28