The document describes a project to link Norwegian and Swedish immigrant men from immigration records between 1885-1895 to census records from 1920 and 1930. It outlines goals of analyzing their economic and social outcomes. An algorithm is used to link names by string distance, focusing on men to avoid changed last names. Results show over 60% of target groups were linked for Norwegians in 1920 and over 65% for Swedes. Future work includes expanding to all birth years and censuses, and using results to pursue grants to further the research.
2. Introduction
Norwegian & Swedish Immigrant Men
Norwegian: 65,361 born 1885-1895
Swedish: 153,477 born 1885-1895
Populations in Censuses (born 1884-96)
Norwegian: 49,006 in 1920; 51,510 in 1930
Swedish: 79,486 in 1920; 84,698 in 1930
3. Goals
Link as many people as possible from the
immigration data to the censuses
Analyze their economic and social
outcomes in the 1920s & 1930s
5. Linking Data
Use algorithm to link individuals from both
sources using their names
Focus on men
Women’s last names may change
Born in 1885-1895 (from immigration)
6. Steps for Linking
Extract men from both sources
Match individual to census data
Use string “distance” measure to narrow
matches to similar sounding names
Remaining data are analyzed further to
find the best matches (ongoing)
7. Summary
Results for Norwegian Matching (1920)
Potential Matches 356,758 Jaro-Winkler
Distance≥0.8
From Immigration 40,397 61.8% of target group
From 1920 Census 31,423 65.1% of target group
8. Summary
Results for Swedish Matching (1920)
Potential Matches 2,748,009 Jaro-Winkler
Distance≥0.8
From Immigration 102,646 66.9% of target group
From 1920 Census 55,844 74% of target group
9. Improving Links and Training Data
Maximizing number of immigrant men
Looked up names for possible gender errors
Missing data on gender
Develop “training data”
Multiple possible matches in some cases
Choose the true and best match for names
12. Economic and Social Outcomes
Investigate the outcomes
Where did most Norwegians and Swedes live
in the 1920s & 1930s?
What are the characteristics of the places?
What are other demographics of the places?
13. Challenges
Historical data
No unique IDs, disorganized
Making judgement calls
Missing county and state level information
Coding
Learning Stata
Learning linking methods
14. Future Directions
Expand to all birth years and censuses
Use result of pilot study for grant proposals
Norwegian and Swedish researchers may
partner in linking to people in their censuses
15.
16. Unique Matches
According
to...
Norwegian Rate from
Remaining in
Potential
Matches
Swedish Rate from
Remaining in
Potential
Matches
Imm. ID 11,330 28% 17,864 17.4%
Census ID 7,924 25.2% 8,405 15.1%
Both Imm ID
& Census ID
2,854 7.1% & 9.1% 2,517 2.4% & 4.5%
Editor's Notes
One one hand we have the data on Norwegian and Swedish immigrants (that came through Ellis Island) in the late 1800s and early 1900s
One the other hand we have Norwegian and Swedish born populations who remained in the United States and were enumerated by the 1920 and 1930 censuses.
In the end, we want to say something about those immigrants who came in through Ellis island and other ports and their experiences in the United States
For this study want to….link as many people as possible…
And...analyze
Pilot studies of a selected group to see how well the process works
Matching means for a person born in 1885, he is matched with individuals in the census who are born in the years 1884-1886
For name “distance” use Jaro-Winkler
Repeat for all censuses and all immigration files
Based on the errors and missing information on gender, I looked at the first names of individuals to determine their gender by using the Dictionary of first name to potentially increase the number of male candidates
Training data - is data that will be used later to automate the finding of true links.
Show the 2 similar names (obvious one and complicate one)
If the names are too similar then I look at the other variables such as the arrival year from the censuses and from the immigration data to decide which is the true match
By using the state and county level Demographic and Economic data, I find the characteristics of places with high Norwegian and Swedish populations.