Researchers know we're supposed to sample people in the proper proportions but how do we know what those proportions are? In this webinar, I will demonstrate how to use census data to determine what your sample should really look like in terms of variables like age, gender, region and .education
How to Create Census Sampling Targets for Free Using Data Ferret
1. November 20, 2014
How to Sample the Right
Percentages of People in
your Study
or
How to Create US Census Sampling Targets
for Free Using Data Ferret
180 Montgomery Street, Suite 1700 - San Francisco, CA 94104
2. What is sampling?
• Choosing people to participate in a study
• Types (in very basic terms)
– Random sampling: You have a list of every single person
who is relevant to the study
– Stratified sampling: Organizing people into groups so you
can select from those groups
– Convenience sampling: You have access to people who are
relevant to the study
3. What is a sampling plan?
• Plan for selecting who will be invited to participate in the
research
• Moms: Study about diapers
• Teenagers: Study about learning to drive
• Young adults: Study about job hunting for the first time
4. What is a sampling matrix
• Specific description
of who you will
sample
• Young adults: Study
about job hunting for
the first time
Percent
Gender Male 50%
Female 50%
Age 16 to 19 33%
20 to 24 33%
25 to 29 33%
Region Northeast 25%
Midwest 25%
South 25%
West 25%
11/20/14 4
5. But what do young adults REALLY look like
Percent
Gender Male 50%
Female 50%
Age 16 to 19 33%
20 to 24 33%
25 to 29 33%
Region Northeast 25%
Midwest 25%
South 25%
West 25%
Percent
Gender Male 50%
Female 50%
Age 16 to 19 25%
20 to 24 35%
25 to 29 40%
Region Northeast 20%
Midwest 20%
South 40%
West 20%
11/20/14 5
25. But what do young adults REALLY look like
“Fair” Guessing Reality
Gender Male 50% 50% 50.2%
Female 50% 50% 49.8%
Age 16 to 19 33% 25% 27.7%
20 to 24 33% 35% 36.8%
25 to 29 33% 40% 35.5%
Region Northeast 25% 20% 17.9%
Midwest 25% 20% 21.2%
South 25% 40% 37.0%
West 25% 20% 23.9%
11/20/14 25
26. Weighting
Census
Targets Returns Weights
Gender Male 50.2% 52% 104%
Female 49.8% 48% 96%
Age 16 to 19 27.7% 29% 105%
20 to 24 36.8% 35% 95%
25 to 29 35.5% 36% 101%
Region Northeast 17.9% 21% 117%
Midwest 21.2% 19% 90%
South 37.0% 36% 97%
West 23.9% 24% 100%
11/20/14 26
27. 11/20/14
27
Thank you!
Annie Pettit
Chief Research Officer
Peanut Labs
annie@peanutlabs.com
jonathan.cheriff@peanutlabs.com
Director of Sales and Marketing
: )
Hi everyone and thank you for finding the time to join me for the next 30 minutes or so. My name is annie pettit and I am the chief research officer at peanutlabs, a company that specializes in DIY sampling, survey programming, and polling. You can find recordings of webinars we’ve done in the past on the peanut labs website under the resources section. Todays webinar is about a very important topic in sampling. It focuses on how to figure out the right demographics to sample in your study. If you’re never used or never heard of data ferret, then you are in for a real treat. If you have any questions along the way, do feel free to type your questions in the question bar. I’ll answer as many of them as possible at the end.
First, let me spend a couple of minutes talking about various types of sampling. There are many different types but I’ll just focus on a few concepts here.
First, there is random sampling. In many cases, this is the absolute best type of sampling you can do. The basic premise here is that you have a list of contact information for every single person who is relevant to you study. For instance, you have email addresses for every single adult in the US. Or you have the telephone number for every single woman who has a child aged 3 to 9. Once we have this complete list, then we can simply start picking names out of a hat to determine who is part of our sample. Clearly, most of us will never be able to conduct a true random sample like this.
Stratified sampling is another type of sampling. The basic premise in this case is that people or things are divided into groups before you sample from them. So, you might create a group of people who live in the city and a group of people who live in the country, and then sample from each of those two groups.
The last type of sampling I’ll mention is called convenience sampling. Most of our work in market research is done with convenience samples. In other words, people who are easy to access. They might be easy to access because these people live closest to our office building, or they have a telephone, or they chose to join a survey panel.
Most of us in market research use stratified convenience samples. We sample from conveniently available people and we bucket them into groups for sampling purposes.
A sampling plan is important for every research project. You need to decide what kinds of people are relevant and important to your study. For instance, you wouldn’t want to invite men to participate in research about feminine hygiene products and you wouldn’t want teenagers participating in a study about retirement homes. When that happens, it’s a waste of everyone’s time and energy.
But, if think ahead of time about who we NEED to take our surveys, then we’ll have much better data in the end. For instance, if we’re going to run a study about diapers, you’ll probably want to talk to moms who have a baby. You might also want to talk to new dads since times are slowly changing and dads are getting in on those kinds of decisions too.
Or, if you’re doing a study about learning to drive, you’ll probably want to focus on younger people, people aged 15 to 20 or so.
Let’s take a specific example. Perhaps people who are job hunting for their first real job. First, we can make some hypotheses about who we want to listen to. We’ve decided that we want to listen to men and women. So let’s say that half of our sample should be men and half should be women. Let’s also say that we want to focus on young adults, perhaps people aged 16 to 29. We can easily divide that age range into three nice groups and then try to put of third of people into each group. Lastly, we might decide that we want to listen to people from all over the USA. We can divide the USA into four groups and then try to put a quarter of all people into each of those groups.
This is a sampling matrix. It makes a lot of sense. We won’t have to worry that everyone in the study is 25 to 29 or that everyone is from the Northeast. This sampling matrix will ensure that we’ve listened to a good range of people.
But is it as good as we think?
The most obvious problem is the first table you see here, the sampling matrix that we just built, is the region part. We know that when most people divide the USA into four regions, there aren’t equal numbers of people in each region. In fact, we know that the most commonly used regions end up putting more people in the south region. This means that the sampling matrix we guessed at isn’t listening to enough people from the south.
What we could do is put a little more care and think about what young people aged 16 to 29 really look like. If we think really carefully, we might come up with a sampling matrix that looks closer to the one on the right. We’ve put more people into the south and we’ve put more people into the older age group.
I still don’t know if this is what the USA really looks like but it’s the best I can do with what I know.
But is it really?
If you’re not already familiar with data ferret, it is an awesome tool put together by the census group in the US government. It’s completely free to use and it lets you look at and analyze a lot of the data that the census department collects from its surveys. For instance the American community survey, the American housing survey, the current population survey as well as population estimates and projections.
We’re going to use data ferret to figure out EXACTLY what our sampling matrix should look like. We don’t need to guess because the US census department conducts regular surveys to learn many many thing about its population including their age, gender, education, income, family status, and much more. The ferret will fill in that sampling matrix for us.
To start I’m going to click on launch data ferret. It is a little bit picky about software so you might find yourself needing to use a different browser, upgrading javascript, or allowing pop-ups. It’s totally fine to allow pop ups on this website.
You’ll be asked to fill out your email address here. It’s fine to do that. They never send spam. At least not yet.
Then you’ll come to this screen. All you really need to do here is choose Step 1. We want to choose which dataset we’re interested in and well as the variables we’re interested in.
On the left hand side, you can see all the surveys that you’re allowed to have access to. For our purposes, we’re interested in the current population survey. We’ll just find that survey, open it up, and then choose the most recent set of data. You can see that they gather fresh data with this survey every single month. So if you want to make sure your data is as accurate as possible, you could come here every month and generate the newest results.
When you click on the dataset you want, you’ll see a menu to view the variables. Once you choose that option, then the list of variables on the right side will appear. For our purposes, we want to see the household, geography, and demographic variables. It will take the ferret a minute or two to find those variables but then it will list them for you. Sometimes, you’ll even get thousands of variables.
Because the list is only 75 variables, it’s simple enough to just scroll through and find the ones you want. In our case, we want age, sex, and region. Sex is down just a little further. And when I double click on it.
I get this box. You just need to click on the select box and then the ok box. Then you can do it again for the age variable and the region variable.
Now we can move on to step 2 where we work with the databasket or create a table. Here you can see the three variables that we’ve chosen, sex, age, and region. What I’m most interested in doing here is two things.
What I really need to do is recode the age variable into two brand new variables. The original variable that the census department gathers is actual individual age. I need to create a variable that groups people according to their age. And, I need to make sure that it uses the 3 age groups that I’m interested in. So, I’m going to create an age group variable like this.
The second variable I need to create is for subsample purposes. It seems a little redundant but you’ll see why we’re doing this in just a minute. In this case, I’m going to create three separate groups. Two of the groups identify people that I’m NOT interested in. One group, the 16 to 29 group, identifies the group of people that I AM interested in. So, I’m going to use this variable to identify who is in my target group.
Let’s hop back into dataferret now. You’ll see on the right side, there is an option to recode variables. So first I’ll click on the age variable and then I’ll click on recode variable. A popup will appear where you can create the groupings for age. I’ll do this twice, once to create the 3 age groups I’m interested in and once to create the subsample groups.
Now all the variables are set up and we just need to select databasket/make a table.
That gives us this blank spreadsheet. All we’re going to do know is click and move each variable onto the spreadsheet into the spot we want. Just remember how we set up the sampling matrix before and we’re going to replicate that here.
Our sampling matrix had sex, age, and region. Just make sure you choose the grouping variable that we created for age. The one that splits the ages 16 to 29 into 3 groups.
The next thing we’re going to do is pick out our subsample. Remember we’re only interested in people who are aged 16 to 29. Here’s where we choose the second age variable, the one we created just for the purposes of sampling. We’ll pop that variable into the top row.
We’re almost there now. First, we’ll tell the ferret that we want to see percentages so we’ll click on the percentage button. Make sure to choose the column percentage button, not the row percentage button. Now we can click on go get data.
And this is our wonderful result!
You’ll see four columns but we’re really only interested in one of the columns. The third column that shows the percentages for people who are in our target group aged 16 to 29.
What this table tells us is that, in the USA, among people who are aged 16 to 29, 50.2% of them are male and 49.8% of them are female. And, 27.7% of them are aged 16 to 19 while 36.8% of them are aged 20 to 24. THIS is what our group of people really looks like so this is what we need to include in our sample.
For those of you who are familiar with weighting, this is also the column that you’re going to weight your results to once you get your data back.
Now once you have this sampling matrix based on real data from the US census, you can just pop it into whatever sampling system you’re using
In this case, we’re looking at the peanut labs samplify system. We can just type the percentages we got from dataferret here and then the sampling system will do its job to pull a collection of people who like just like this.
Of course, there is an unlimited number of sampling matrices you might need. Here is another one. In this case, I’ve added the metropolitan variable which basically means people who live in rural vs urban areas. I’ve also added in Hispanic as a variable.
Let’s think back to 20 minutes ago when we first started to think about our sampling matrix. We tried to make things fair by saying that we wanted 25% of people to come from the northeast. But then we realized that wasn’t very fair at all and we guess that the number should maybe be 20%. But, when we looked at real census data we discovered that that number should actually be 17.9%. It doesn’t seem like a big deal but there are probably many cases where you just can’t imagine what the breakouts should be. Well, now we’ve seen just how easy it is to get the right numbers instead of asking all your colleagues to give their best guestimate.
And, now that we know what the true census targets are, we can do any weighting that might be necessary
That brings me to the end of my talk. If you have any questions, please do feel free to type them in the box and I’ll answer as many as I can.
You mentioned weighting? What is that?
Sometimes you’ll find that even the best sampling plan doesn’t work out perfectly in the end. Maybe you were trying to get responses from 100 men aged 16 to 19 but you only got responses from 80. It could bias the results to not include those other 20 young men in your overall results. What we do in this case is make them men we do have count just a little bit more. Instead of letting them each person count as one person each, we’ll make each person count as one and a quarter people. 80 people times 1.25 gives us 100 people. It’s not quite as good as actually having 100 unique people. I’d rather you go back to field and wait to get another 20 people but the real world doesn’t always let that happen.