IAOS 2018 - Teaching basic statistics using integrated global census and survey (IPUMS) data, L. Cleveland, K. Jeffers, M. King
1. Teaching Basic Statistics Using
Integrated Global Census and
Survey (IPUMS) Data
Lara Cleveland, Kristen Jeffers, Miriam King
IPUMS
University of Minnesota
2.
3. 89 countries – 345 censuses – 988 million person records
IPUMS-International microdata availability
4. • Individual-level microdata samples
• Anonymized
• Harmonized
• Data extract system
• Online data analysis
• Free
6. • Access
• Data are messy
• Documentation may not be available
• Universe of respondents often unclear
• Changes in wording (and meaning) of
questions
• Variables coded differently
• year to year
• country to country
Challenges of real-world data
7. Importance of data context
Children ever born
0 1,082,616
1 38,609
2 35,372
3 32,257
4 29,260
5 24,474
6 21,558
7 16,020
8 12,988
9 9,728
10+ 19,097
8. Importance of data context
Who was asked this question or included in this dataset? Does
the subject population change over time or vary across
samples? How can we deal with such inconsistencies?
What was the wording of the question that people responded
to? Did this change over time or vary across samples? Are these
changes or differences likely to influence results?
What proportion of people refused to answer or didn't know
the answer to a question? What are the options for dealing with
these missing data? What are we assuming when we just
exclude the missing cases?
11. Input
Bangladesh
2011
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
Mexico
1970
1 = Married, civil & relig
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Kenya
1999
1 = Never married
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
12. LabelCode
Harmonized
1 = Never married1 = Married, civil & relig
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Single
Married or in union
Married, formally
Civil
Religious
Civil and religious
Monogamous
Polygamous
Consensual union
Separated
Divorced
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
1 0 0
2 0 0
2 1 0
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 2 0
0 0
3 1 0
3 2 0
0 0
Mexico
1970
Input
Bangladesh
2011
Kenya
1999
Divorced or separated3
Widowed4
13.
14. Text of Census Questionnaire
(Mexico 2000)
5. Number of Rooms
How many rooms are used for sleeping without counting hallways?
_____ Write the number
Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count
the kitchen
_____Write the number
6. Access to water
Read all of the options until you get an affirmative answer.
Circle only one answer
1 Running water inside the dwelling
2 Running water outside the dwelling but on the land
3 Running water from a public faucet or hydrant
4 Running water that is carried from another dwelling
5 Tanked in by truck
6 Water from a well, river, lake, stream or other
Answers 3, 4, 5, 6 continue with number 8
7. Water supply
How many days of the week is water available?
Circle only one answer
1 Daily
2 Every third day
3 Twice a week
4 Once a week
5 Occasionally
23. 3. Submit extract
Pooled data extract sample water sex education
Argentina 2001
3.6 million
Chile 2002
1.5 million
Cuba 2002
1.1 million
Extract
Engine
Argentina 2001
Chile 2002
Cuba 2002
Water supply
Sex
Education
1. Select samples
2. Select variables
1 dataset
3 censuses
4 variables
6.2 million records
Harmonized codes
24.
25. Advantages of IPUMS data
• Data access
• Documentation
• Train students to deal with messy, missing,
inconsistent data
• Customizable, modifiable
• Real-life relevance
26. IPUMS for Teaching Statistics
Specific needs of statistics educators
• Data analysis in Excel; R
• Specific types of data suited to teaching particular
techniques (e.g., continuous variables for OLS
regression analysis)
• Examples of how to use real-world population data to
teach basic statistics concepts and techniques
28. IPUMS-based exercises for
teaching basic statistics
9 exercises follow topics covered in 3rd edition of
Statistics: The Art and Science of Learning From
Data by Alan Agresti and Christine Franklin
29. IPUMS-based exercises for
teaching basic statistics
Topics covered:
• Exploring Data
(frequency distributions;
graphs)
• Probability
• Probability Distributions
• Confidence Intervals
• Hypothesis Testing
• Comparing Two Groups
• Association between
Categorical Variables
• Regression Analysis
• Multiple Regression
Describe the IPUMS database
Discuss the advantages of using real-world data from IPUMS in the classroom
Talk about exercises we have developed to support teaching basic statistics with IPUMS data
IPUMS is a collection of 9 data projects housed at the University of Minnesota that provide census and survey data from around the world. The projects are funded by federal research grants from the NSF and NIH and all the data are available for free online. The two projects most relevant for international audiences are…IPUMS-International—which provide harmonized census microdata from nearly 100 countries around the world, and IPUMS Global Health, which provides health survey data for Africa and Asia from the demographic and health surveys and a relatively new survey called PMA 2020.
Map indicates the data available from IPUMS-International. IPUMS global health covers a subset of African and Asian countries.
Number of common features shared across all IPUMS projects
To be clear, IPUMS disseminates population microdata where each case represents an individual, and each column or group of columns represents that individual’s response to a census or survey questions.
Most IPUMS data is organized into households, so individuals can be analyzed in the context of their families and households.
Value of using real-world data in the classroom in universally acknowledged, but there are significant challenges associated with doing so.
Access—identifying and accessing appropriate datasets can be difficult and time-consuming
The data are often messy and students are rarely equipped to deal with messy data on their own
Documentation about data collection and data treatment may be unavailable or insufficient
From one census to the next or from one survey to the next, there are often changes in data collection practices or question wording that make it difficult to make accurate comparisons across time and place
Similarly, data producers are often inconsistent in the coding schema they apply to the data presenting another challenge for comparative analysis
Despite these challenges, it’s extremely important to expose students to the complexity of real-world data and to train them to think critically about the meaning of underlying data. Take this example from the 2010 census of Zambia. Here is a frequency distribution for the children-ever-born variable as we received it from the Central Statistical Office. We can train students to calculate summary statistics for this variable, but unless they consider the real meaning of these data, their results will be inaccurate. And why is that, you may wonder.
If we consults the documentation that provides context to these data—in this case the census questionnaire—we learn that only women age 12 and older were asked this question. But the frequency distribution adds up to the total number of cases in the dataset—so where are the males and young females?
Who was asked this question or included in this dataset? Does the subject population change over time or vary across samples? How can we deal with such inconsistencies?
What was the wording of the question that people responded to? Did this change over time or vary across samples? Are these changes or differences likely to influence results?
What proportion of people refused to answer or didn't know the answer to a question? What are the options for dealing with these missing data? What are we assuming when we just exclude the missing cases?
So despites its challenges, we contend that it is better to train students with real-world, imperfect data, to prepare them to answer such questions as:
Who was asked this question or included in this dataset? Does the subject population change over time or vary across samples? How can we deal with such inconsistencies?
What was the wording of the question that people responded to? Did this change over time or vary across samples? Are these changes or differences likely to influence results?
What proportion of people refused to answer or didn't know the answer to a question? What are the options for dealing with these missing data? What are we assuming when we just exclude the missing cases?
Data we receive from data producers to make data more user-friendly
Zambia 2010
Equally as important for students to now how to calculate and interpret summary statistics as understand the meaning of the underlying data--
If you don’t understand the way the data were collected and treated, what they are measuring, you will always risk inaccurate results/conclusions. Working with real world data exposes teaches students to consider these things as part of responsible statistical analysis.
Who was asked this question or included in this dataset? Does the subject population change over time or vary across samples? How can we deal with such inconsistencies?
What was the wording of the question that people responded to? Did this change over time or vary across samples? Are these changes or differences likely to influence results?
What proportion of people refused to answer or didn't know the answer to a question? What are the options for dealing with these missing data? What are we assuming when we just exclude the missing cases?
Majority of IPUMS data users are demographer, economists, sociologists using data for substantive research.
During the last several years, IPUMS staff have engaged with statistics educators as a new audience for these free online data from around the world.
Beginning with an IPUMS workshop at the ICOTS 9 conference, and continuing with participation in meetings of the International Association for Statistics Education,
Discussions with statistics educators were eye-opening for the IPUMS staff members, who are primarily trained in the social sciences. We learned the following:
While social and health scientists use statistical packages such as Stata, SAS, and SPSS, statistics educators tend to use Excel and the open-source R programming language for analysis;
While social and health scientists are interested in particular substantive topics, statistics educators need specific types of data suited to teaching particular techniques (e.g., continuous variables for OLS regression analysis);
Statistics educators welcome concrete examples of how to use real-world data to teach basic statistics concepts and techniques.
Teachers of statistics classes are ideally preparing their students to use statistical methods after they finish with the class, in jobs and in further studies requiring analysis of real data and "big data." Our conversations with both students and teachers of statistics suggest that teaching materials that accompany statistics textbooks are all too often unlike real-world data. For example, the number of cases may be very small, and there may be no missing or inconsistent data. In real-world data, from censuses, surveys, corporate or government records, or other sources, the number of cases may be very large, and missing and inconsistent data are a given.
In addition to the enhancements IPUMS makes to the data that make them easier to use for all data users, we have also begun to develop materials to support teaching basic statistics with IPUMs data.
Beginning with an IPUMS workshop at the ICOTS 9 conference, and continuing with participation in meetings of the International Association for Statistics Education,
While social and health scientists use statistical packages such as Stata, SAS, and SPSS, statistics educators tend to use Excel and the open-source R programming language for analysis;
That statistics educators need help identifying among the thousands of variables available from IPUMS which are suited to teaching particular techniques (e.g., continuous variables for OLS regression analysis);
Statistics educators welcome concrete examples of how to use real-world data to teach basic statistics concepts and techniques.
In addition to building R package that reads ipums microdata and metadata
Available on CRAN
Feedback from ICOTS 9, IASE satellite meetings
Make use of the unique quality, complexity and breadth of the IPUMS-International and IPUMS-DHS datasets for teaching purposes
Extend the reach of the IPUMS datasets to a broader audience of both statistics teachers and students from around the world
Expose teachers and students from the field of statistics to the complexity and messiness of real-life data
In response to these special needs of statistics educators, IPUMS hired two Fellows to create exercises for teaching basic statistics using IPUMS-International and IPUMS-DHS data. Erez Garnai, a doctoral student in Sociology with experience teaching statistics to undergraduate Sociology majors, and Stephanie Chen, an undergraduate majoring in Social Statistics, followed the topics covered in a basic statistics textbook to create a series of 9 exercises based on IPUMS-International census data and 9 exercises based on IPUMS-DHS health survey data. Each exercise noted the statistics topics covered, specified the variables used, recommended country- and year-specific samples to use, supplied R programming code, and included answers and interpretive questions to test students' understanding of the topic. Specifically, these exercises covered:
Feedback from ICOTS 9, IASE satellite meetings
Make use of the unique quality, complexity and breadth of the IPUMS-International and IPUMS-DHS datasets for teaching purposes
Extend the reach of the IPUMS datasets to a broader audience of both statistics teachers and students from around the world
Expose teachers and students from the field of statistics to the complexity and messiness of real-life data
Notes the statistics topics covered
Specifies the variables used
Recommends country- and year-specific samples to use
Supplies R programming code
Includes answers and interpretive questions to test students' understanding of the topic
Such exercises can be modified to fit the needs of teachers or serve as examples that spur the instructor's own creativity in creating new material using census and survey data from nearly 100 countries.
Understand implications of dropping missing values