Phase 1 - Research Data CollectionName Points.docx
141_Project_try2
1. An Explorative Study of H1-B Visas
Rosy Garcia-Rivas, Kevin Huang, Macaria Robinson, Meredith Valenzuela
May 2015
Abstract
The H-1B visa is a non-immigrant visa in the US that allows US employers to temporarily employ
foreign workers in specialty occupations. Our group conducted a population-based assessment of H-1B
visa approval ratings to predict whether or not a visa would be approved to an applicant, based on many
different factors. We found that applicants in math/science industries were the highest proportion of
applicants for H-1B visas. Furthermore, we found that people do not solely follow areas of highest pay,
as people in the health and math/science fields tended to move to the Northeast, while the Midwest and
West coast pay higher rates. What we can extrapolate from this information is that people in different
fields tend to flow to their particular regions of interest for different reasons.
1
2. Contents
I Introduction 3
II Questions 3
III Description of the Sample and Data Collection 3
IV Variables of the Study and How They Were Measured 4
V Statistical Methods 5
1 Logistic Regression 5
2 Contingency Tables 5
VI Summary of Findings 6
VII Conclusions Drawn From the Study 8
VIII Shortcomings 9
IX Recomendation for Future Research 9
X R Code 9
List of Tables
1 Table of Main Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Logistic Regression: Wage Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Logistic Regression: Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Applicants, Wage by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Applicants by Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Proportion of Certified Applicants by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 Proportion of Certified Applicants Applying by Industry . . . . . . . . . . . . . . . . . . . . . 8
8 Two Way Table of Median Wage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
List of Figures
1 Map of Employed H1-B Visa Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Barplot of Industries by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2
3. Part I
Introduction
The H-1B visa is a non-immigrant visa in the United States that allows US employers to temporarily employ
foreign workers in specialty occupations. The minimum requirements for obtaining this visa classification
are:
(1) the applicant must have a US employer to sponsor him/her
(2) the job that the applicant is applying must be requiring bachelor’s degree or higher
(3) the job duties and the applicant’s education/work experience must be correlated
(4) the job must pay at least the prevailing wage in the area for that service
More than ten percent of the undergraduate population at the University of California, Los Angeles
are international students. The H-1B visa is important tomany of these students because it is one avenue
for them to remain in the US after graduation. Therefore, our group wanted to investigate the likelihood
of, and significant contributors to, obtaining an H-1B visa. Certain industries, such as the technology and
finance industries, congregate in specific locations around the US, for example San Francisco and New York.
Moreover, we have observed that there seems to be a higher proportion of international students in certain
academic fields compared to others at UCLA. Therefore, we expected the strongest contributing factors of
whether an H-1B visa was approved or not to be:
(a) industry the petitioner would be employed in
(b) geographic location of the job
(c) the interaction between industry and geographic location
Unfortunately, our data for showed a 99% approval rate for H-1B visa approvals and was not represen-
tative of the actual acceptance rate in the US in 2007, which meant attempting to determine the strongest
contributing factors of H-1B visa approval and predicting whether a visa would be approved or not would
be meaningless with our data. Therefore, we chose to focus on finding extracting information about the
applicants of approved H-1B visas, focusing mainly on their chosen industry, academic field, and the geo-
graphic location of their employment. Given that the US is a hub for cutting edge research and technology
in health, biotechnology, and business/finance, and is also a strong force in pop culture and the arts, we
thought this endeavor would be the most informative, and give us insight on why people wanted to move
to the US, and subsequently where the US stands in the eyes of outsiders. Furthermore, we hope that a
deeper understanding of approved visas will provide future applicants with a better idea of which locations
and industries to consider.
Part II
Questions
The following are the questions that we hope to answer through our analysis of the H1-B Visa dataset.
• Is the approval likelihood higher for people working in certain fields than other fields?
• Is the likelihood higher for people applying for an H-1B visa in different regions?
• Is the mean prevailing wage for people applying for the H-1B within the industries higher in certain
regions than others?
• Can we predict whether a person’s H-1B visa will get approved based on certain variables?
3
4. Part III
Description of the Sample and Data Collection
The H1-B data used in this report was provided to ETA by employers who submitted foreign labor cer-
tification applications for the year 2007. The original data set has 426,597 observations and 39 variables,
although we chose to limit our study to H1-B visas that were approved. The sample includes information
about the employer’s location, the job position, and salary, but does not offer any information about the
applicants themselves, such as gender, age, or country of origin. It is important to note that the sample
data was not supplied by H1-B applicants themselves, but rather by their employers to the Bureau of Labor
Statistics’ Occupational Employment Statistics survey. The Bureau of Labor Statistics’ (BLS) Occupational
Employment Statistics program provides estimates used to assist in setting the wage levels in the Foreign
Labor Center (FLC) wage library.
Figure 1: Map of Employed H1-B Visa Workers
Map based on location. Color shows detail about Industry. Size shows median wage. Data excludes Alaska
and Hawaii. BLUE = Education. ORANGE= Engineering and Computer Sciences. GREEN= Finance and
Business. RED= Health. PURPLE= Science and Math.
Figure 1 shows the distribution of H1-B workers by industry and wage. From this graph we see that most
applicants are focused on the coasts, with a significant portion spread out throughout the Midwest. The
mountainous regions in the west are more sparsely populated. Applicants in the health and science/math
industries appear to be the highest proportion of, and richest, approved visa applicants across the country.
Moreover, this figure shows that health and education jobs are distributed in a way that appears to follow
the US population’s overall distribution. This is in stark contrast to the distribution of business/finance and
science/math applicants, who seem to cluster around big cities.
4
5. Part IV
Variables of the Study and How They Were
Measured
Variables of interest for this study centered around industry type, geographical location, and salary. The
variables Industry and Region were created from the variables Job Title and State, respectively. The levels of
Industry are Finance and business, Health, Education, Science/Math, and Engineering and Computers, while
the levels of Region are Midwest, Northeast, South, and West. We chose to categorize Industry by these fields
because we felt these professions were encompassing of the careers of H1-B visa workers. Additionally, Case
Status initially included four different levels: Certified, Denied, Hold, and Pending. Since we are primarily
concerned whether a candidate will be approved or not we subsetted the data to include only certified and
denied cases.
Variable Description Measurement
Industry Industry of Employment Categorical
Region Region of Employment Categorical
Case Status Approval status- certified or denied Categorical
Wage_Rate_From_1 Employer’s proposed wage rate Numerical
Table 1: Table of Main Variables
Part V
Statistical Methods
1 Logistic Regression
Our initial focus for this research was to classify whether or not a candidate’s visa application would be
accepted or denied, and to determine which factors are most important to being approved. Because our
response was binary (“Accepted” vs. “Denied”), we decided to run a logistic regression. Logistic regression
would allow us to classify the visa petitions while steering clear of other complex models that could possibly
overfit the data.
To begin, we wanted to ensure we had a portion of the data to test our model with, so we cut 30% of the
cases into a testing dataset and used the remaining 70% of the cases to build our model. We built our logistic
regression model using backward selection on the training dataset to see which model provided the lowest
Akaike Information Criterion (AIC). The AIC punishes models for complexity, thus preventing a model from
overfitting the data that it was constructed with. From there we used cross-validation techniques to see
which of the models with the lowest AIC also provided the lowest test mean square error. When tabling our
success rate, we learned that our model performed well when predicting an approved visa but performed very
poorly when predicting a denied visa. To investigate why our model’s accuracy was highly skewed we tabled
the data and saw that the data itself was significantly skewed towards approved visas, where approved visas
made up 99% of the observations. Moving forward, we then decided to construct a simple logistic regression
models with each predictor to decide which of the levels of each variable were above the 95% significance
level. The simple logistic regression provided insight that we applied to our next method, contingency tables.
2 Contingency Tables
Due to the limitations of our data (99% of cases are certified), we used contingency tables to provide deeper
analyses of the relationships between region, industry, and wage. Contingency tables were beneficial in com-
paring the frequencies of different combinations of industries and location. Getting a better understanding
5
6. of the popularity of industries in different locations allowed allows us to create a profile of characteristics of
H1-B visa workers and employers, and hopefully will allow us to inform our peers and others of ideal places
to work.
Part VI
Summary of Findings
After building our logistic regression models we found that most of the models we obtained gave a successful
prediction rate of around 99%. Since 99% of the applicants in our data set were “certified”, the logistic
regression by itself was not appropriate to answer our questions.
All the variables, except for the wage, were statistically insignificant. See table 2. Therefore, we ran
three simple logistic regression models to inform us of which levels of factors were potentially significant for
industry and region, and then used contingency tables to analyze the prominence of those factors.
For every one unit increase in wage, the likelihood of certification increases by 1. Looking at the con-
tingency tables, we can see that applicants that have applied for a H1-B visa in the health industry, versus
the business & finance industry, changes the odds by 1.4. On the other hand, . . . . (Can we add another
example here because I’m still not understanding the table of odds) We see that the health and business &
finance industries have the greatest positive impacts on certification.
We then investigated region through the use of contingency tables. We began by looking at the candidates
in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast
while only 16% applied in the Midwest (see table 3). After investigating the candidates by industry, we found
that the science & math industry dominates in popularity, with the vast majority of immigrants (68.7%)
applying for work in these fields (see table 3). It is our belief that the United State’s gap in science & math
education - in comparison to other developed countries - is one reason for the popularity of the industry.
Although there is a relatively small percentage of US natives seeking careers in science and math there is a
great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields.
Variable Coefficient p-value
Intercept -3.18 <2e-16
Wage_Rate_From_1 -2.46e-5 < 2e-16
Odds
0.04
1.00
yi = 3.18 + WageRateF rom1 ⇤ xi
Table 2: Logistic Regression: Wage Rate
For every one unit increase in wage, the likelihood of certification increases by 1.
Variable Coefficient p-value
Intercept -5.99 <2e-16
Industry:Education -0.1359 0.4495
Industry:Health 0.3960 0.0313
Industry:Science & Math -0.3169 0.0165
Odds
0.003
0.87
1.4
0.72
yi = 5.99 + Industry ⇤ xi
Table 3: Logistic Regression: Industry
Having applied for a H1-B visa in the Health industry, versus the Business & Finance industry, changes
the odds by 1.4. We see that the Health and Business & Finance industries have the greatest positive impacts
on certification.
We then investigated region through the use of two by two tables. We began by looking at the candidates
in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast
6
7. while only 16% applied in the Midwest region (see table 3). After investigating the candidates by industry,
we found that the Science and Math industry dominates in popularity, with the vast majority of immigrants
(68.7%) applying for work in these fields (see table 3). It is our belief that the United State’s gap in
science/math education - in comparison to other developed countries- is one reason for the popularity of the
industry. Although there is a relatively small percentage of U.S. natives seeking careers in science and math
there is a great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields.
Northeast West Midwest South
Number of Applicants 101,865 (33.3%) 66,304 (21.7%) 51,110 (16.7%) 86,547 (28.3%)
Median Prevailing Wage $58,000 $70,620 $55,000 $55,000
Table 4: Applicants, Wage by Region
Business&Finance Education Health Science & Math
41,487 (13.5%) 35,231 (11.5%) 19,206 (6.27%) 210,329 (68.7%)
Table 5: Applicants by Industry
Figure 2: Barplot of Industries by Region
Because we are only interested in the characteristics of the applicants that get certified, we subsetted the
data by removing all denied cases.
Looking at Table 7, we are able to see that in the business & finance industry, 38.9% of the people that
are certified reside in the Northeast, while only 11.50% of the people certified in business & finance reside in
the Midwest. One possible explanation of these results is that the low population density and small number
of big cities causes immigrants in the business & finance industry to avoid the Midwest. Interestingly enough,
only 21.27% of the applicants are certified to work in the West. We expect this is because the Northeast
contains NYC and therefore Wall Street, which carries a big name and is the central hub for finance in the
US, drawing people away from the Midwest and West coast. Interestingly, 39.5% of the certified foreigners
in the education industry (Table 7) are migrating to the South. Looking at the two-way table of the median
7
8. start wage by region and by industry in conjunction with Figure 2, we see that all of the regions pay the
highest in the health industry except the Northeast, yet the majority of approved applicants are moving to
the Northeast in spite of this fact.
West Midwest Northeast South
Business & Finance 0.1336 0.0933 0.1584 0.1339
Education 0.0959 0.1226 0.0835 0.1607
Health 0.0522 0.0782 0.0638 0.0599
Science and Math 0.7181 0.7057 0.6941 0.6453
Table 6: Proportion of Certified Applicants by Region
West Midwest Northeast South
Business & Finance 0.2317 0.1149 0.3892 0.2796
Education 0.1806 0.1779 0.2415 0.3950
Health 0.1805 0.2084 0.3393 0.2703
Science and Math 0.2263 0.1714 0.3361 0.2654
Table 7: Proportion of Certified Applicants Applying by Industry
West Midwest Northeast South
Business & Finance $61,000 $57,600 $62,000 $53,672
Education $47,500 $45,000 $47,000 $42,500
Health $85,000 $65,000 $56,650 $71,760
Science and Math $75,000 $55,000 $58,240 $58,000
Table 8: Two Way Table of Median Wage
To summarize, only the business & finance industry followed the region that pays the most. For the
health industry, the West pays the H-1B applicants the most; however, 33.9% of the applicants in health are
migrating to the Northeast while only 18% are certified in the West. The same is true of the science & math
industry: 33% of certified H-1B foreigners are migrating to the Northeast, even though the West pays 1.28
times more. Also, looking at the map (Figure 1), we see that the jobs in the health and education industries
are distributed in a similar way to the distribution of the US population, indicating an evenly growing
demand throughout the US for education and health workers. On the other hand, the jobs in business &
finance and science & math are more clustered in bigger US cities, indicating a need for large populations
for these industries and possibly suggesting a need for faster communication in these industries.
Part VII
Conclusions Drawn From the Study
According to Table 4 the Midwest, when compared to the Northeast is a popular location for immigrant
workers granted H1-B visas status for the Education, Engineering/ Comp. Sciences, Finance and Business,
and Science and Math but the Health industry. It also appears as though the Northeast, when compared to
the South, is a less popular location of work for those same industries, excluding Health.
After analyzing all the contingency tables, we discovered that in both Business & Finance industry and
for the Education industry, the highest proportion of applicants for the H-1B visa tend to follow the regions
where the prevailing wage is the highest. However, for the Health industry, the Midwest actually pays the
most; however, 33.9% of the applicants in Health are migrating to the Northeast instead, while only 20% are
migrating to the Midwest. The same happens for Science and Math; 33.6% of certified H-1B foreigners are
8
9. migrating to the Northeast, even though the West pays 1.19 times more. We recommend that individuals
seeking H1-B visa status in the field of:
• Health to move to the Midwest
• Business and Finance to move to the west, but slightly less chances of getting certified science
• Math are wanted in all regions but to get the highest pay to move to the west.
Part VIII
Shortcomings
The shortcomings of our research are due to a non-representative sample of H1-B visa applicants. Given
a sample with a more rate of visa approval more consistent with reality would have allowed us to create a
more ideal logistic regression model. Moreover, a lack of information about applicants, i.e. gender, country
of origin, networth, etc., kept us from drawing more comprehensive conclusions about H1-B applicants and
the likelihood of certification. Because of this, we have too many unknown covariates to account for.
Working with a dataset that was highly skewed was problematic but at this time in the project we had
already ran into issues. We were initially assigned an education dataset. However, given that we had already
worked with similar data in another class we felt it best to gain exposure in something different. Since all of
our groupmates were interested in criminal data we found a dataset that surveyed prisoners. Unfortunately,
this dataset had a lot of missing information and asked over three thousand questions. Thus, we ultimately
moved toward the dataset of H-1B visas which appealed to our entire group and was rich with information.
Part IX
Recomendation for Future Research
We recommend doing a more comprehensive analysis using both employer data and applicant data. The
country of origin of the applicant or the gender of the applicant might have a more significant impact on
visa approval than location of employment or industry
Part X
R Code
load("~/Downloads/H1B.RData")
# Getting Only Cases that are Certified or Denied
c=which(h1b$Case_Status=="Certified")
d=which(h1b$Case_Status=="Denied")
k=c(c,d)
hb=h1b[k,]
#Getting to know the data
#Dealing with dates for the glm
#What time of the year is the petition submitted?
library(lubridate)
hb$timeyr=month(as.POSIXlt(hb$Submitted_Date, format="%d/%m/%Y"))
#Gets the month that it was submitted
#Breaks down into what part of the year
hb$timeoyr=rep("End",length(hb$timeyr))
hb$timeoyr[which(hb$timeyr<9)]="Middle"
hb$timeoyr[which(hb$timeyr<5)]="Beginning"
#Variable the time the visa would last?
9
10. hb$visa_last=as.numeric(hb$End_Date-hb$Begin_Date)/364.25
hb$visa_last=round(as.numeric(hb$visa_last),1)
#Cleaning up State Variable and Creating Regions
#States were sorted according to Regions using the Census data
hb$State=as.character(hb$State)
temp=nchar(hb$State, type = "chars", allowNA = FALSE)
st.rm=which(temp>2)
hb$State[which(hb$State=="Newton")]="MA"
hb$State[which(hb$State=="Seattle")]="WA"
hb$State[which(hb$State=="Chantilly")]="VA"
hb$State[which(hb$State=="New York")]="NY"
#Create a Region Variable to get rid of all the levels in State
hb$Region<-rep("",dim(hb)[1])
NE<-c("CT","PA","NJ","NY","RI","NH","VT","ME","MA")
MW<-c("ND","SD","NE","KS","MO","IA","MN","WI","IL","IN","OH","MI")
S<-c("MD","DE","DC","VA","WV","KY","TN","GA","AL","MS"
,"FL","AR","LA","OK","TX","NC","SC")
W<-c("WA","ID","MT","WY","CO","UT","AZ","NM","NV","CA","OR","AK","HI")
hb$Region[which(hb$State %in% NE)]="NorthEast"
hb$Region[which(hb$State %in% MW)]="MidWest"
hb$Region[which(hb$State %in% S)]="South"
hb$Region[which(hb$State %in% W)]="West"
#Recoding Occupational Code by Industry
hb$Industry=rep(NA,dim(hb)[1])
hb$Job_Title=as.character(hb$Job_Title)
hb$Job_Title<-iconv(enc2utf8(hb$Job_Title),sub="byte")
hb$Job_Title=lapply(hb$Job_Title,tolower)
#Engineers and Computer Related
archit<-grep("archit",hb$Job_Title)
eng<-grep("engin",hb$Job_Title)
tech<-grep("tech",hb$Job_Title)
comp<-grep("com",hb$Job_Title)
prog<-grep("progr",hb$Job_Title)
dev<-grep("developer",hb$Job_Title)
sysans<-grep("systems analyst",hb$Job_Title)
sysan<-grep("system analyst",hb$Job_Title)
soft<-grep("software",hb$Job_Title)
CE<-c(archit,eng,tech,comp,prog,dev,sysans,sysan,soft)
hb$Industry[CE]="Engineering & Computer"
dat<-grep("data",hb$Job_Title)
math<-grep("math",hb$Job_Title)
stats<-grep("statist",hb$Job_Title)
chem<-grep("chemis",hb$Job_Title)
SM<-c(dat,math,stats,chem)
hb$Industry[SM]="Science & Math"
med<-grep("medic",hb$Job_Title)
clinic<-grep("clinic",hb$Job_Title)
phys<-grep("physic",hb$Job_Title)
physo<-grep("physio",hb$Job_Title)
dentist<-grep("denti",hb$Job_Title)
dental<-grep("dental",hb$Job_Title)
pathol<-grep("pathol",hb$Job_Title)
pharm<-grep("pharm",hb$Job_Title)
sci<-grep("scientist",hb$Job_Title)
10
12. ########## Following R Code utilizes the cleaned data set we submitted
##########
#********************************BEGIN USE OF CLEANED DATA SET***********************************#
hb$Industry<-relevel(as.factor(hb$Industry),"Engineering & Computer")
#splitting into training and testing or
set.seed(7)
test=sample(1:dim(hb)[1],(dim(hb)[1]*.3))
train=hb[-test,]
testing=hb[test,]
#######MODELING ATTEMPTS######################################
######t<-glm(Case_Status~visa_last,data=hb,family="binomial")
######tt<-glm(Case_Status~timeoyr,data=hb,family="binomial")
######ttt<-glm(Case_Status~Nbr_Immigrants,data=hb,family="binomial")
######t4<-glm(Case_Status~Prevailing_Wage_1,data=hb,family="binomial")
######t5<-glm(Case_Status~Region,data=hb,family="binomial",subset=train)
######total<-glm(Case_Status~visa_last+Region+timeoyr+Program.Designation+
######Nbr_Immigrants+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1,data=train,family="binomial")
######step(total,direction="backward")
######sb1<-glm(formula = Case_Status ~ visa_last + timeoyr + Program.Designation +
###### Wage_Rate_From_1 + Wage_Rate_Per_1 + Part_Time_1 + Prevailing_Wage_1,
###### family = "binomial", data = train)
#AIC 9549
######FUll<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry,data=train,family="binomial")
#AIC 9580
######step(FUll,direction="backward")
######AIC(FUll)
#full with Visa last interactions/ AIC 9356
######FwithvlI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry,data=train,family="binomial")
#full with visa_last & time of year interaction/AIC 9287.995
######F2I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry,data=train,family="binomial")
#AIC=58633
######F3I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### data=train,family="binomial")
######F4I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
12
13. ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry
###### ,data=train,family="binomial")
#COmpletely full with all possible Interaction Plots
###### CFI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry+
###### Prevailing_Wage_1*Withdrawn+Prevailing_Wage_1*Industry+Withdrawn*Industry
###### ,data=train,family="binomial")
######step(CFI,direction="backward")
####### Semi full model, missing some interaction plots
######Csf<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry,
###### data=train,family="binomial")
######step(Csf,direction="backward")
######phony<-glm(Case_Status~Wage_Rate_From_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+visa_last*Wage_Rate_From_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn
###### ,data=train,family="binomial")
################################################################
#Simplify the Industry Data
table(hb$Industry)
hb$newIndustry <- NULL
hb$newIndustry[hb$Industry %in% c("Engineering & Computer", "Science & Math")] <- "Science &
Math"
hb$newIndustry[hb$Industry %in% c("Finance and Business")] <- "Business and Finance"
hb$newIndustry[hb$Industry %in% c("Health")] <- "Health"
hb$newIndustry[hb$Industry %in% c("Education")] <- "Education"
#removing the levels of Case_status
hb$Case_Status <- as.character(hb$Case_Status)
hb$Case_Status <- factor(hb$Case_Status)
#disregard all other observations that are not paid by year
newhb <- hb[hb$Wage_Rate_Per_1 %in% c("Year"),]
#then we are disregarding all other observations
newhb <- newhb[newhb$newIndustry %in% c("Science & Math", "Business and Finance", "Health",
"Education"),]
newhb2 <- newhb
# subset to use for Tableau
13
14. h1bTab= subset(hb1, Case_Status=="Certified", select=c(State, Region, Industry, Wage_Rate_From_1,
Zip_Code))
write.csv(h1bTab, "h1bTab.csv")
# Contingency Table of Certified Cases
table(newhb$Region, newhb$Industry, newhb$Case_Status)
#Contingency tables
table(newhb2$Case_Status)
table(newhb2$newIndustry)
table(newhb2$Region[!newhb2$Region==""])
tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""], newhb2$Region[!newhb2$Region==""], me-
dian)
tapply(newhb2$Wage_Rate_From_1, newhb2$newIndustry, median)
tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""],
list(newhb2$newIndustry[!newhb2$Region==""],
newhb2$Region[!newhb2$Region==""]), median)
#The Median Starting Wage in Business and Finance is Highest in the NorthEast
#The Median Starting Wage in Education is Highest in the South
#The Median Starting Wage in Science and Math is Highest in the West
#The Median Starting Wage in Health is Highest in the West
#Distribution of People who are certified
table(newhb$newIndustry[newhb$Case_Status=="Certified"])
table(newhb$Region[newhb$Case_Status=="Certified"])
table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"])
prop.table(table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"]), 1)
#In all of the Regions, we see that the highest percentage of people that are
#getting denied are all from Science and Math Industry.
#The lowest in the Midwest denying rate are in the Business and Finance Industry
#The lowest in the Northeast denying rate are in the Education and Health
#The lowest in the Westt denying rate are in the Education
#The lowest in the South denying rate are in the Health Industry
prop.table(table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"]), 2)
#In the Business and Finace Industry the lowest denying rate is from the NorthEast Region
#In the Education Industry the lowest denying rate is from the South Region
#In the Health Industry the lowest denying rate is from the Midwest Region
#In the Science and the Math Industry the lowest denying rate is from the NorthEast Region
## ANOTHER MUCH SIMPLE LOGISTIC REGRESSION
#Creating training and testing data. 70% Training and 30% Testing data
set.seed(55555)
train <- newhb2[sample(nrow(newhb2), nrow(newhb2)*.70), ]
test <- newhb2[!(row.names(newhb2) %in% row.names(train)),]
#All simple models are statistically significant
glm1 <- glm(Case_Status~Region, data=train, family="binomial")
summary(glm1)
glm2 <- glm(Case_Status~newIndustry, data=train, family="binomial")
summary(glm2)
glm3 <- glm(Case_Status~Wage_Rate_From_1, data=train, family="binomial")
summary(glm3)
#2 factors+interaction model+covariate model
glm.model <- glm(Case_Status~Region+factor(newIndustry)+
Wage_Rate_From_1+Region:newIndustry, data=train, family="binomial")
summary(glm.model)
14
15. #REGIONS ARE NOT STATISTICALLY SIGNIFICANT
#Only the WAGE IS STATISTICALLY SIGNIFICANT
#THE TYPE OF INDUSTRY IS NOT SIGNIFICANT
#NONE OF THE INTERACTIONS ARE SIGNIFICANT.
#CHECKING PREDICTION ERROR
pred.vals=predict(glm.model, test, type="response")
#
pred=ifelse(pred.vals >median(pred.vals), "Denied", "Certified")
table(pred, test$Case_Status)/(length(pred.vals))
error <- 1-sum(diag(table(pred, test$Case_Status)/(length(pred.vals))))/
sum(table(pred, test$Case_Status)/(length(pred.vals)))
error
#NULL DEVIANCE TEST
glm.model$null.deviance
glm.model$df.null
pchisq(glm.model$null.deviance-glm.model$deviance, 1, lower=FALSE)
#Very small pvalue, therefore, we reject the null hypothesis indicating that we have
#statistically significance to show that the slope of the logistic regression line is not equal to zero
#RESIDUAL DEVIANCE TEST
pchisq(glm.model$deviance,glm.model$df.residual,lower=FALSE)
#VERY large pvalue. we fail to reject our null hypothesis.
15
16. Appendix
Variable Description
Submitted_Date Date the application was submitted
Program.Designation Types of H-1B Visas
Employer_Name Employer’s name
Address_1 Employer’s address
City Employer’s city
State Employer’s state
Zip_Code Employer’s postal code
Nbr_Immigrants Number of job openings
Begin_Date Proposed begin date
End_Date Proposed end date
Job_Title Job title
DOL_DecisionDate Date certified or denied
Certified_Begin_Date Certification start date
Certified_End_Date Certification end date
Occupation_Code Three digit occupational group
Case_Status Approval status- certified or denied
Wage_Rate_From_1 Employer’s proposed wage rate
Wage_Rate_Per_1 Unit of pay for proposed wage rate
Wage_Rate_To_1 Maximum proposed wage rate
Part_Time_1 Y = Part time; N = Full time position
Work_City_1 Work city (location of the job opening)
Work_State_1 Work_State_1
Prevailing_Wage_1 Prevailing wage rate
Prevailing_Wage_Source_1 Collective bargining; SESA; Other
Year_Source_Published_1 Year that the prevailing wage data was published
Other_Wage_Source_1 Year that the prevailing wage data was published
Other_Wage_Source_2 Description of the Other wage source
16