SlideShare a Scribd company logo
1 of 16
Download to read offline
An Explorative Study of H1-B Visas
Rosy Garcia-Rivas, Kevin Huang, Macaria Robinson, Meredith Valenzuela
May 2015
Abstract
The H-1B visa is a non-immigrant visa in the US that allows US employers to temporarily employ
foreign workers in specialty occupations. Our group conducted a population-based assessment of H-1B
visa approval ratings to predict whether or not a visa would be approved to an applicant, based on many
different factors. We found that applicants in math/science industries were the highest proportion of
applicants for H-1B visas. Furthermore, we found that people do not solely follow areas of highest pay,
as people in the health and math/science fields tended to move to the Northeast, while the Midwest and
West coast pay higher rates. What we can extrapolate from this information is that people in different
fields tend to flow to their particular regions of interest for different reasons.
1
Contents
I Introduction 3
II Questions 3
III Description of the Sample and Data Collection 3
IV Variables of the Study and How They Were Measured 4
V Statistical Methods 5
1 Logistic Regression 5
2 Contingency Tables 5
VI Summary of Findings 6
VII Conclusions Drawn From the Study 8
VIII Shortcomings 9
IX Recomendation for Future Research 9
X R Code 9
List of Tables
1 Table of Main Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Logistic Regression: Wage Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Logistic Regression: Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Applicants, Wage by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Applicants by Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Proportion of Certified Applicants by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 Proportion of Certified Applicants Applying by Industry . . . . . . . . . . . . . . . . . . . . . 8
8 Two Way Table of Median Wage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
List of Figures
1 Map of Employed H1-B Visa Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Barplot of Industries by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2
Part I
Introduction
The H-1B visa is a non-immigrant visa in the United States that allows US employers to temporarily employ
foreign workers in specialty occupations. The minimum requirements for obtaining this visa classification
are:
(1) the applicant must have a US employer to sponsor him/her
(2) the job that the applicant is applying must be requiring bachelor’s degree or higher
(3) the job duties and the applicant’s education/work experience must be correlated
(4) the job must pay at least the prevailing wage in the area for that service
More than ten percent of the undergraduate population at the University of California, Los Angeles
are international students. The H-1B visa is important tomany of these students because it is one avenue
for them to remain in the US after graduation. Therefore, our group wanted to investigate the likelihood
of, and significant contributors to, obtaining an H-1B visa. Certain industries, such as the technology and
finance industries, congregate in specific locations around the US, for example San Francisco and New York.
Moreover, we have observed that there seems to be a higher proportion of international students in certain
academic fields compared to others at UCLA. Therefore, we expected the strongest contributing factors of
whether an H-1B visa was approved or not to be:
(a) industry the petitioner would be employed in
(b) geographic location of the job
(c) the interaction between industry and geographic location
Unfortunately, our data for showed a 99% approval rate for H-1B visa approvals and was not represen-
tative of the actual acceptance rate in the US in 2007, which meant attempting to determine the strongest
contributing factors of H-1B visa approval and predicting whether a visa would be approved or not would
be meaningless with our data. Therefore, we chose to focus on finding extracting information about the
applicants of approved H-1B visas, focusing mainly on their chosen industry, academic field, and the geo-
graphic location of their employment. Given that the US is a hub for cutting edge research and technology
in health, biotechnology, and business/finance, and is also a strong force in pop culture and the arts, we
thought this endeavor would be the most informative, and give us insight on why people wanted to move
to the US, and subsequently where the US stands in the eyes of outsiders. Furthermore, we hope that a
deeper understanding of approved visas will provide future applicants with a better idea of which locations
and industries to consider.
Part II
Questions
The following are the questions that we hope to answer through our analysis of the H1-B Visa dataset.
• Is the approval likelihood higher for people working in certain fields than other fields?
• Is the likelihood higher for people applying for an H-1B visa in different regions?
• Is the mean prevailing wage for people applying for the H-1B within the industries higher in certain
regions than others?
• Can we predict whether a person’s H-1B visa will get approved based on certain variables?
3
Part III
Description of the Sample and Data Collection
The H1-B data used in this report was provided to ETA by employers who submitted foreign labor cer-
tification applications for the year 2007. The original data set has 426,597 observations and 39 variables,
although we chose to limit our study to H1-B visas that were approved. The sample includes information
about the employer’s location, the job position, and salary, but does not offer any information about the
applicants themselves, such as gender, age, or country of origin. It is important to note that the sample
data was not supplied by H1-B applicants themselves, but rather by their employers to the Bureau of Labor
Statistics’ Occupational Employment Statistics survey. The Bureau of Labor Statistics’ (BLS) Occupational
Employment Statistics program provides estimates used to assist in setting the wage levels in the Foreign
Labor Center (FLC) wage library.
Figure 1: Map of Employed H1-B Visa Workers
Map based on location. Color shows detail about Industry. Size shows median wage. Data excludes Alaska
and Hawaii. BLUE = Education. ORANGE= Engineering and Computer Sciences. GREEN= Finance and
Business. RED= Health. PURPLE= Science and Math.
Figure 1 shows the distribution of H1-B workers by industry and wage. From this graph we see that most
applicants are focused on the coasts, with a significant portion spread out throughout the Midwest. The
mountainous regions in the west are more sparsely populated. Applicants in the health and science/math
industries appear to be the highest proportion of, and richest, approved visa applicants across the country.
Moreover, this figure shows that health and education jobs are distributed in a way that appears to follow
the US population’s overall distribution. This is in stark contrast to the distribution of business/finance and
science/math applicants, who seem to cluster around big cities.
4
Part IV
Variables of the Study and How They Were
Measured
Variables of interest for this study centered around industry type, geographical location, and salary. The
variables Industry and Region were created from the variables Job Title and State, respectively. The levels of
Industry are Finance and business, Health, Education, Science/Math, and Engineering and Computers, while
the levels of Region are Midwest, Northeast, South, and West. We chose to categorize Industry by these fields
because we felt these professions were encompassing of the careers of H1-B visa workers. Additionally, Case
Status initially included four different levels: Certified, Denied, Hold, and Pending. Since we are primarily
concerned whether a candidate will be approved or not we subsetted the data to include only certified and
denied cases.
Variable Description Measurement
Industry Industry of Employment Categorical
Region Region of Employment Categorical
Case Status Approval status- certified or denied Categorical
Wage_Rate_From_1 Employer’s proposed wage rate Numerical
Table 1: Table of Main Variables
Part V
Statistical Methods
1 Logistic Regression
Our initial focus for this research was to classify whether or not a candidate’s visa application would be
accepted or denied, and to determine which factors are most important to being approved. Because our
response was binary (“Accepted” vs. “Denied”), we decided to run a logistic regression. Logistic regression
would allow us to classify the visa petitions while steering clear of other complex models that could possibly
overfit the data.
To begin, we wanted to ensure we had a portion of the data to test our model with, so we cut 30% of the
cases into a testing dataset and used the remaining 70% of the cases to build our model. We built our logistic
regression model using backward selection on the training dataset to see which model provided the lowest
Akaike Information Criterion (AIC). The AIC punishes models for complexity, thus preventing a model from
overfitting the data that it was constructed with. From there we used cross-validation techniques to see
which of the models with the lowest AIC also provided the lowest test mean square error. When tabling our
success rate, we learned that our model performed well when predicting an approved visa but performed very
poorly when predicting a denied visa. To investigate why our model’s accuracy was highly skewed we tabled
the data and saw that the data itself was significantly skewed towards approved visas, where approved visas
made up 99% of the observations. Moving forward, we then decided to construct a simple logistic regression
models with each predictor to decide which of the levels of each variable were above the 95% significance
level. The simple logistic regression provided insight that we applied to our next method, contingency tables.
2 Contingency Tables
Due to the limitations of our data (99% of cases are certified), we used contingency tables to provide deeper
analyses of the relationships between region, industry, and wage. Contingency tables were beneficial in com-
paring the frequencies of different combinations of industries and location. Getting a better understanding
5
of the popularity of industries in different locations allowed allows us to create a profile of characteristics of
H1-B visa workers and employers, and hopefully will allow us to inform our peers and others of ideal places
to work.
Part VI
Summary of Findings
After building our logistic regression models we found that most of the models we obtained gave a successful
prediction rate of around 99%. Since 99% of the applicants in our data set were “certified”, the logistic
regression by itself was not appropriate to answer our questions.
All the variables, except for the wage, were statistically insignificant. See table 2. Therefore, we ran
three simple logistic regression models to inform us of which levels of factors were potentially significant for
industry and region, and then used contingency tables to analyze the prominence of those factors.
For every one unit increase in wage, the likelihood of certification increases by 1. Looking at the con-
tingency tables, we can see that applicants that have applied for a H1-B visa in the health industry, versus
the business & finance industry, changes the odds by 1.4. On the other hand, . . . . (Can we add another
example here because I’m still not understanding the table of odds) We see that the health and business &
finance industries have the greatest positive impacts on certification.
We then investigated region through the use of contingency tables. We began by looking at the candidates
in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast
while only 16% applied in the Midwest (see table 3). After investigating the candidates by industry, we found
that the science & math industry dominates in popularity, with the vast majority of immigrants (68.7%)
applying for work in these fields (see table 3). It is our belief that the United State’s gap in science & math
education - in comparison to other developed countries - is one reason for the popularity of the industry.
Although there is a relatively small percentage of US natives seeking careers in science and math there is a
great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields.
Variable Coefficient p-value
Intercept -3.18 <2e-16
Wage_Rate_From_1 -2.46e-5 < 2e-16
Odds
0.04
1.00
yi = 3.18 + WageRateF rom1 ⇤ xi
Table 2: Logistic Regression: Wage Rate
For every one unit increase in wage, the likelihood of certification increases by 1.
Variable Coefficient p-value
Intercept -5.99 <2e-16
Industry:Education -0.1359 0.4495
Industry:Health 0.3960 0.0313
Industry:Science & Math -0.3169 0.0165
Odds
0.003
0.87
1.4
0.72
yi = 5.99 + Industry ⇤ xi
Table 3: Logistic Regression: Industry
Having applied for a H1-B visa in the Health industry, versus the Business & Finance industry, changes
the odds by 1.4. We see that the Health and Business & Finance industries have the greatest positive impacts
on certification.
We then investigated region through the use of two by two tables. We began by looking at the candidates
in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast
6
while only 16% applied in the Midwest region (see table 3). After investigating the candidates by industry,
we found that the Science and Math industry dominates in popularity, with the vast majority of immigrants
(68.7%) applying for work in these fields (see table 3). It is our belief that the United State’s gap in
science/math education - in comparison to other developed countries- is one reason for the popularity of the
industry. Although there is a relatively small percentage of U.S. natives seeking careers in science and math
there is a great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields.
Northeast West Midwest South
Number of Applicants 101,865 (33.3%) 66,304 (21.7%) 51,110 (16.7%) 86,547 (28.3%)
Median Prevailing Wage $58,000 $70,620 $55,000 $55,000
Table 4: Applicants, Wage by Region
Business&Finance Education Health Science & Math
41,487 (13.5%) 35,231 (11.5%) 19,206 (6.27%) 210,329 (68.7%)
Table 5: Applicants by Industry
Figure 2: Barplot of Industries by Region
Because we are only interested in the characteristics of the applicants that get certified, we subsetted the
data by removing all denied cases.
Looking at Table 7, we are able to see that in the business & finance industry, 38.9% of the people that
are certified reside in the Northeast, while only 11.50% of the people certified in business & finance reside in
the Midwest. One possible explanation of these results is that the low population density and small number
of big cities causes immigrants in the business & finance industry to avoid the Midwest. Interestingly enough,
only 21.27% of the applicants are certified to work in the West. We expect this is because the Northeast
contains NYC and therefore Wall Street, which carries a big name and is the central hub for finance in the
US, drawing people away from the Midwest and West coast. Interestingly, 39.5% of the certified foreigners
in the education industry (Table 7) are migrating to the South. Looking at the two-way table of the median
7
start wage by region and by industry in conjunction with Figure 2, we see that all of the regions pay the
highest in the health industry except the Northeast, yet the majority of approved applicants are moving to
the Northeast in spite of this fact.
West Midwest Northeast South
Business & Finance 0.1336 0.0933 0.1584 0.1339
Education 0.0959 0.1226 0.0835 0.1607
Health 0.0522 0.0782 0.0638 0.0599
Science and Math 0.7181 0.7057 0.6941 0.6453
Table 6: Proportion of Certified Applicants by Region
West Midwest Northeast South
Business & Finance 0.2317 0.1149 0.3892 0.2796
Education 0.1806 0.1779 0.2415 0.3950
Health 0.1805 0.2084 0.3393 0.2703
Science and Math 0.2263 0.1714 0.3361 0.2654
Table 7: Proportion of Certified Applicants Applying by Industry
West Midwest Northeast South
Business & Finance $61,000 $57,600 $62,000 $53,672
Education $47,500 $45,000 $47,000 $42,500
Health $85,000 $65,000 $56,650 $71,760
Science and Math $75,000 $55,000 $58,240 $58,000
Table 8: Two Way Table of Median Wage
To summarize, only the business & finance industry followed the region that pays the most. For the
health industry, the West pays the H-1B applicants the most; however, 33.9% of the applicants in health are
migrating to the Northeast while only 18% are certified in the West. The same is true of the science & math
industry: 33% of certified H-1B foreigners are migrating to the Northeast, even though the West pays 1.28
times more. Also, looking at the map (Figure 1), we see that the jobs in the health and education industries
are distributed in a similar way to the distribution of the US population, indicating an evenly growing
demand throughout the US for education and health workers. On the other hand, the jobs in business &
finance and science & math are more clustered in bigger US cities, indicating a need for large populations
for these industries and possibly suggesting a need for faster communication in these industries.
Part VII
Conclusions Drawn From the Study
According to Table 4 the Midwest, when compared to the Northeast is a popular location for immigrant
workers granted H1-B visas status for the Education, Engineering/ Comp. Sciences, Finance and Business,
and Science and Math but the Health industry. It also appears as though the Northeast, when compared to
the South, is a less popular location of work for those same industries, excluding Health.
After analyzing all the contingency tables, we discovered that in both Business & Finance industry and
for the Education industry, the highest proportion of applicants for the H-1B visa tend to follow the regions
where the prevailing wage is the highest. However, for the Health industry, the Midwest actually pays the
most; however, 33.9% of the applicants in Health are migrating to the Northeast instead, while only 20% are
migrating to the Midwest. The same happens for Science and Math; 33.6% of certified H-1B foreigners are
8
migrating to the Northeast, even though the West pays 1.19 times more. We recommend that individuals
seeking H1-B visa status in the field of:
• Health to move to the Midwest
• Business and Finance to move to the west, but slightly less chances of getting certified science
• Math are wanted in all regions but to get the highest pay to move to the west.
Part VIII
Shortcomings
The shortcomings of our research are due to a non-representative sample of H1-B visa applicants. Given
a sample with a more rate of visa approval more consistent with reality would have allowed us to create a
more ideal logistic regression model. Moreover, a lack of information about applicants, i.e. gender, country
of origin, networth, etc., kept us from drawing more comprehensive conclusions about H1-B applicants and
the likelihood of certification. Because of this, we have too many unknown covariates to account for.
Working with a dataset that was highly skewed was problematic but at this time in the project we had
already ran into issues. We were initially assigned an education dataset. However, given that we had already
worked with similar data in another class we felt it best to gain exposure in something different. Since all of
our groupmates were interested in criminal data we found a dataset that surveyed prisoners. Unfortunately,
this dataset had a lot of missing information and asked over three thousand questions. Thus, we ultimately
moved toward the dataset of H-1B visas which appealed to our entire group and was rich with information.
Part IX
Recomendation for Future Research
We recommend doing a more comprehensive analysis using both employer data and applicant data. The
country of origin of the applicant or the gender of the applicant might have a more significant impact on
visa approval than location of employment or industry
Part X
R Code
load("~/Downloads/H1B.RData")
# Getting Only Cases that are Certified or Denied
c=which(h1b$Case_Status=="Certified")
d=which(h1b$Case_Status=="Denied")
k=c(c,d)
hb=h1b[k,]
#Getting to know the data
#Dealing with dates for the glm
#What time of the year is the petition submitted?
library(lubridate)
hb$timeyr=month(as.POSIXlt(hb$Submitted_Date, format="%d/%m/%Y"))
#Gets the month that it was submitted
#Breaks down into what part of the year
hb$timeoyr=rep("End",length(hb$timeyr))
hb$timeoyr[which(hb$timeyr<9)]="Middle"
hb$timeoyr[which(hb$timeyr<5)]="Beginning"
#Variable the time the visa would last?
9
hb$visa_last=as.numeric(hb$End_Date-hb$Begin_Date)/364.25
hb$visa_last=round(as.numeric(hb$visa_last),1)
#Cleaning up State Variable and Creating Regions
#States were sorted according to Regions using the Census data
hb$State=as.character(hb$State)
temp=nchar(hb$State, type = "chars", allowNA = FALSE)
st.rm=which(temp>2)
hb$State[which(hb$State=="Newton")]="MA"
hb$State[which(hb$State=="Seattle")]="WA"
hb$State[which(hb$State=="Chantilly")]="VA"
hb$State[which(hb$State=="New York")]="NY"
#Create a Region Variable to get rid of all the levels in State
hb$Region<-rep("",dim(hb)[1])
NE<-c("CT","PA","NJ","NY","RI","NH","VT","ME","MA")
MW<-c("ND","SD","NE","KS","MO","IA","MN","WI","IL","IN","OH","MI")
S<-c("MD","DE","DC","VA","WV","KY","TN","GA","AL","MS"
,"FL","AR","LA","OK","TX","NC","SC")
W<-c("WA","ID","MT","WY","CO","UT","AZ","NM","NV","CA","OR","AK","HI")
hb$Region[which(hb$State %in% NE)]="NorthEast"
hb$Region[which(hb$State %in% MW)]="MidWest"
hb$Region[which(hb$State %in% S)]="South"
hb$Region[which(hb$State %in% W)]="West"
#Recoding Occupational Code by Industry
hb$Industry=rep(NA,dim(hb)[1])
hb$Job_Title=as.character(hb$Job_Title)
hb$Job_Title<-iconv(enc2utf8(hb$Job_Title),sub="byte")
hb$Job_Title=lapply(hb$Job_Title,tolower)
#Engineers and Computer Related
archit<-grep("archit",hb$Job_Title)
eng<-grep("engin",hb$Job_Title)
tech<-grep("tech",hb$Job_Title)
comp<-grep("com",hb$Job_Title)
prog<-grep("progr",hb$Job_Title)
dev<-grep("developer",hb$Job_Title)
sysans<-grep("systems analyst",hb$Job_Title)
sysan<-grep("system analyst",hb$Job_Title)
soft<-grep("software",hb$Job_Title)
CE<-c(archit,eng,tech,comp,prog,dev,sysans,sysan,soft)
hb$Industry[CE]="Engineering & Computer"
dat<-grep("data",hb$Job_Title)
math<-grep("math",hb$Job_Title)
stats<-grep("statist",hb$Job_Title)
chem<-grep("chemis",hb$Job_Title)
SM<-c(dat,math,stats,chem)
hb$Industry[SM]="Science & Math"
med<-grep("medic",hb$Job_Title)
clinic<-grep("clinic",hb$Job_Title)
phys<-grep("physic",hb$Job_Title)
physo<-grep("physio",hb$Job_Title)
dentist<-grep("denti",hb$Job_Title)
dental<-grep("dental",hb$Job_Title)
pathol<-grep("pathol",hb$Job_Title)
pharm<-grep("pharm",hb$Job_Title)
sci<-grep("scientist",hb$Job_Title)
10
olog<-grep("ologist",hb$Job_Title)
nurse<-grep("nurse",hb$Job_Title)
trist<-grep("trist",hb$Job_Title)
ped<-grep("pediat",hb$Job_Title)
H<-c(med,clinic,phys,physo,dentist,dental,pathol,pharm,sci,olog,nurse,trist,ped)
hb$Industry[H]="Health"
acc<-grep("account",hb$Job_Title)
actuar<-grep("actuar",hb$Job_Title)
fin<-grep("finan",hb$Job_Title)
budg<-grep("budg",hb$Job_Title)
econ<-grep("econom",hb$Job_Title)
bus<-grep("busine",hb$Job_Title)
assoc<-which(hb$Job_Title=="associate")
re<-grep("real",hb$Job_Title)
mark<-grep("market",hb$Job_Title)
pm<-grep("project manage",hb$Job_Title)
sales<-grep("sales",hb$Job_Title)
BIZ<-c(acc,actuar,fin,budg,econ,bus,assoc,re,mark,pm,sales)
hb$Industry[BIZ]="Finance and Business"
prof<-grep("prof",hb$Job_Title)
fel<-grep("fellow",hb$Job_Title)
res<-grep("research associate",hb$Job_Title)
phd<-grep("postdoc",hb$Job_Title)
teach<-grep("teacher",hb$Job_Title)
lect<-grep("lectur",hb$Job_Title)
instr<-grep("instruct",hb$Job_Title)
ot<-grep("occupational therapist",hb$Job_Title)
pd<-grep("post doctoral",hb$Job_Title)
EDU<-c(prof,fel,res,phd,teach,lect,instr,ot,pd)
hb$Industry[EDU]="Education"
law<-grep("law",hb$Job_Title)
attorn<-grep("attor",hb$Job_Title)
leg<-grep("legal",hb$Job_Title)
L<-c(law,attorn,leg)
hb$Industry[L]="Law"
design<-grep("designe",hb$Job_Title)
graph<-grep("graphic",hb$Job_Title)
fash<-grep("fash",hb$Job_Title)
ARTS<-c(design,graph,fash)
hb$Industry[ARTS]="Arts"
urb<-grep("urban",hb$Job_Title)
social<-grep("social",hb$Job_Title)
worker<-grep("worker",hb$Job_Title)
SOC<-c(urb,social,worker)
hb$Industry[SOC]="Public Service"
hb$Industry[is.na(hb$Industry)]= "Other"
phony<-c(eng,comp,acc,fin,prog,soft,prof,fel,res,phd,sysan,sysans,dev,dat,fash,tech,teach,law,attorn
,actuar,lect,mark,math,design,phys,physo,dentist,pathol,budg,econ,pharm,sci,stats,bus
,clinic,social,med,pm,sales,leg,re,archit,olog,nurse,chem,trist,dental,worker,instr,assoc,
ot)
View(hb[-phony,])
ph<-as.character(ph<-hb$Job_Title[-phony])
##########
########## End of Cleaning Code.
11
########## Following R Code utilizes the cleaned data set we submitted
##########
#********************************BEGIN USE OF CLEANED DATA SET***********************************#
hb$Industry<-relevel(as.factor(hb$Industry),"Engineering & Computer")
#splitting into training and testing or
set.seed(7)
test=sample(1:dim(hb)[1],(dim(hb)[1]*.3))
train=hb[-test,]
testing=hb[test,]
#######MODELING ATTEMPTS######################################
######t<-glm(Case_Status~visa_last,data=hb,family="binomial")
######tt<-glm(Case_Status~timeoyr,data=hb,family="binomial")
######ttt<-glm(Case_Status~Nbr_Immigrants,data=hb,family="binomial")
######t4<-glm(Case_Status~Prevailing_Wage_1,data=hb,family="binomial")
######t5<-glm(Case_Status~Region,data=hb,family="binomial",subset=train)
######total<-glm(Case_Status~visa_last+Region+timeoyr+Program.Designation+
######Nbr_Immigrants+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1,data=train,family="binomial")
######step(total,direction="backward")
######sb1<-glm(formula = Case_Status ~ visa_last + timeoyr + Program.Designation +
###### Wage_Rate_From_1 + Wage_Rate_Per_1 + Part_Time_1 + Prevailing_Wage_1,
###### family = "binomial", data = train)
#AIC 9549
######FUll<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry,data=train,family="binomial")
#AIC 9580
######step(FUll,direction="backward")
######AIC(FUll)
#full with Visa last interactions/ AIC 9356
######FwithvlI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry,data=train,family="binomial")
#full with visa_last & time of year interaction/AIC 9287.995
######F2I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry,data=train,family="binomial")
#AIC=58633
######F3I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### data=train,family="binomial")
######F4I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
12
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry
###### ,data=train,family="binomial")
#COmpletely full with all possible Interaction Plots
###### CFI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry+
###### Prevailing_Wage_1*Withdrawn+Prevailing_Wage_1*Industry+Withdrawn*Industry
###### ,data=train,family="binomial")
######step(CFI,direction="backward")
####### Semi full model, missing some interaction plots
######Csf<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+
###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+
###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+
###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+
###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+
###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu
###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry,
###### data=train,family="binomial")
######step(Csf,direction="backward")
######phony<-glm(Case_Status~Wage_Rate_From_1+Part_Time_1+
###### Prevailing_Wage_1+Withdrawn+visa_last*Wage_Rate_From_1+visa_last*Part_Time_1+
###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn
###### ,data=train,family="binomial")
################################################################
#Simplify the Industry Data
table(hb$Industry)
hb$newIndustry <- NULL
hb$newIndustry[hb$Industry %in% c("Engineering & Computer", "Science & Math")] <- "Science &
Math"
hb$newIndustry[hb$Industry %in% c("Finance and Business")] <- "Business and Finance"
hb$newIndustry[hb$Industry %in% c("Health")] <- "Health"
hb$newIndustry[hb$Industry %in% c("Education")] <- "Education"
#removing the levels of Case_status
hb$Case_Status <- as.character(hb$Case_Status)
hb$Case_Status <- factor(hb$Case_Status)
#disregard all other observations that are not paid by year
newhb <- hb[hb$Wage_Rate_Per_1 %in% c("Year"),]
#then we are disregarding all other observations
newhb <- newhb[newhb$newIndustry %in% c("Science & Math", "Business and Finance", "Health",
"Education"),]
newhb2 <- newhb
# subset to use for Tableau
13
h1bTab= subset(hb1, Case_Status=="Certified", select=c(State, Region, Industry, Wage_Rate_From_1,
Zip_Code))
write.csv(h1bTab, "h1bTab.csv")
# Contingency Table of Certified Cases
table(newhb$Region, newhb$Industry, newhb$Case_Status)
#Contingency tables
table(newhb2$Case_Status)
table(newhb2$newIndustry)
table(newhb2$Region[!newhb2$Region==""])
tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""], newhb2$Region[!newhb2$Region==""], me-
dian)
tapply(newhb2$Wage_Rate_From_1, newhb2$newIndustry, median)
tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""],
list(newhb2$newIndustry[!newhb2$Region==""],
newhb2$Region[!newhb2$Region==""]), median)
#The Median Starting Wage in Business and Finance is Highest in the NorthEast
#The Median Starting Wage in Education is Highest in the South
#The Median Starting Wage in Science and Math is Highest in the West
#The Median Starting Wage in Health is Highest in the West
#Distribution of People who are certified
table(newhb$newIndustry[newhb$Case_Status=="Certified"])
table(newhb$Region[newhb$Case_Status=="Certified"])
table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"])
prop.table(table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"]), 1)
#In all of the Regions, we see that the highest percentage of people that are
#getting denied are all from Science and Math Industry.
#The lowest in the Midwest denying rate are in the Business and Finance Industry
#The lowest in the Northeast denying rate are in the Education and Health
#The lowest in the Westt denying rate are in the Education
#The lowest in the South denying rate are in the Health Industry
prop.table(table(newhb$Region[newhb$Case_Status=="Certified"],
newhb$newIndustry[newhb$Case_Status=="Certified"]), 2)
#In the Business and Finace Industry the lowest denying rate is from the NorthEast Region
#In the Education Industry the lowest denying rate is from the South Region
#In the Health Industry the lowest denying rate is from the Midwest Region
#In the Science and the Math Industry the lowest denying rate is from the NorthEast Region
## ANOTHER MUCH SIMPLE LOGISTIC REGRESSION
#Creating training and testing data. 70% Training and 30% Testing data
set.seed(55555)
train <- newhb2[sample(nrow(newhb2), nrow(newhb2)*.70), ]
test <- newhb2[!(row.names(newhb2) %in% row.names(train)),]
#All simple models are statistically significant
glm1 <- glm(Case_Status~Region, data=train, family="binomial")
summary(glm1)
glm2 <- glm(Case_Status~newIndustry, data=train, family="binomial")
summary(glm2)
glm3 <- glm(Case_Status~Wage_Rate_From_1, data=train, family="binomial")
summary(glm3)
#2 factors+interaction model+covariate model
glm.model <- glm(Case_Status~Region+factor(newIndustry)+
Wage_Rate_From_1+Region:newIndustry, data=train, family="binomial")
summary(glm.model)
14
#REGIONS ARE NOT STATISTICALLY SIGNIFICANT
#Only the WAGE IS STATISTICALLY SIGNIFICANT
#THE TYPE OF INDUSTRY IS NOT SIGNIFICANT
#NONE OF THE INTERACTIONS ARE SIGNIFICANT.
#CHECKING PREDICTION ERROR
pred.vals=predict(glm.model, test, type="response")
#
pred=ifelse(pred.vals >median(pred.vals), "Denied", "Certified")
table(pred, test$Case_Status)/(length(pred.vals))
error <- 1-sum(diag(table(pred, test$Case_Status)/(length(pred.vals))))/
sum(table(pred, test$Case_Status)/(length(pred.vals)))
error
#NULL DEVIANCE TEST
glm.model$null.deviance
glm.model$df.null
pchisq(glm.model$null.deviance-glm.model$deviance, 1, lower=FALSE)
#Very small pvalue, therefore, we reject the null hypothesis indicating that we have
#statistically significance to show that the slope of the logistic regression line is not equal to zero
#RESIDUAL DEVIANCE TEST
pchisq(glm.model$deviance,glm.model$df.residual,lower=FALSE)
#VERY large pvalue. we fail to reject our null hypothesis.
15
Appendix
Variable Description
Submitted_Date Date the application was submitted
Program.Designation Types of H-1B Visas
Employer_Name Employer’s name
Address_1 Employer’s address
City Employer’s city
State Employer’s state
Zip_Code Employer’s postal code
Nbr_Immigrants Number of job openings
Begin_Date Proposed begin date
End_Date Proposed end date
Job_Title Job title
DOL_DecisionDate Date certified or denied
Certified_Begin_Date Certification start date
Certified_End_Date Certification end date
Occupation_Code Three digit occupational group
Case_Status Approval status- certified or denied
Wage_Rate_From_1 Employer’s proposed wage rate
Wage_Rate_Per_1 Unit of pay for proposed wage rate
Wage_Rate_To_1 Maximum proposed wage rate
Part_Time_1 Y = Part time; N = Full time position
Work_City_1 Work city (location of the job opening)
Work_State_1 Work_State_1
Prevailing_Wage_1 Prevailing wage rate
Prevailing_Wage_Source_1 Collective bargining; SESA; Other
Year_Source_Published_1 Year that the prevailing wage data was published
Other_Wage_Source_1 Year that the prevailing wage data was published
Other_Wage_Source_2 Description of the Other wage source
16

More Related Content

Viewers also liked

Measuring Inconsistency in meta-analyses
Measuring Inconsistency in meta-analysesMeasuring Inconsistency in meta-analyses
Measuring Inconsistency in meta-analyses
Juan Rubio
 

Viewers also liked (10)

Measuring Inconsistency in meta-analyses
Measuring Inconsistency in meta-analysesMeasuring Inconsistency in meta-analyses
Measuring Inconsistency in meta-analyses
 
Kaedah melupus wang syubhah
Kaedah melupus wang syubhahKaedah melupus wang syubhah
Kaedah melupus wang syubhah
 
Services of Franciscan Bread for the Poor
Services of Franciscan Bread for the PoorServices of Franciscan Bread for the Poor
Services of Franciscan Bread for the Poor
 
8 a food &amp; digestion
8 a food &amp; digestion8 a food &amp; digestion
8 a food &amp; digestion
 
20150924交通部:「高鐵苗栗、彰化及雲林車站工程狀況及營運準備情形」報告
20150924交通部:「高鐵苗栗、彰化及雲林車站工程狀況及營運準備情形」報告20150924交通部:「高鐵苗栗、彰化及雲林車站工程狀況及營運準備情形」報告
20150924交通部:「高鐵苗栗、彰化及雲林車站工程狀況及營運準備情形」報告
 
Pic24HJ256GP210 course 2015 09-23
Pic24HJ256GP210 course 2015 09-23Pic24HJ256GP210 course 2015 09-23
Pic24HJ256GP210 course 2015 09-23
 
PACE-IT, Security+ 4.5: Mitigating Risks in Alternative Environments
PACE-IT, Security+ 4.5: Mitigating Risks in Alternative EnvironmentsPACE-IT, Security+ 4.5: Mitigating Risks in Alternative Environments
PACE-IT, Security+ 4.5: Mitigating Risks in Alternative Environments
 
pneumatic & hydrolic
pneumatic & hydrolic pneumatic & hydrolic
pneumatic & hydrolic
 
Perspectiva
PerspectivaPerspectiva
Perspectiva
 
8 i heating and cooling
8 i heating and cooling8 i heating and cooling
8 i heating and cooling
 

Similar to 141_Project_try2

2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
rhetttrevannion
 
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docxPart 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
ssuser562afc1
 
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docxPart 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
karlhennesey
 
Phase 1 - Research Data CollectionName Points.docx
Phase 1 - Research Data CollectionName Points.docxPhase 1 - Research Data CollectionName Points.docx
Phase 1 - Research Data CollectionName Points.docx
karlhennesey
 

Similar to 141_Project_try2 (20)

Cognos and Tableau Visualization for Business Report Implementation
Cognos and Tableau Visualization for Business Report ImplementationCognos and Tableau Visualization for Business Report Implementation
Cognos and Tableau Visualization for Business Report Implementation
 
Accomplished CRA Thesis by John Michael Villagracia
Accomplished CRA Thesis by John Michael VillagraciaAccomplished CRA Thesis by John Michael Villagracia
Accomplished CRA Thesis by John Michael Villagracia
 
Maney - Strategic immigration planning
Maney - Strategic immigration planning Maney - Strategic immigration planning
Maney - Strategic immigration planning
 
M&CI Brochure TC
M&CI Brochure TCM&CI Brochure TC
M&CI Brochure TC
 
Responsible Conduct of International Research
Responsible Conduct of International ResearchResponsible Conduct of International Research
Responsible Conduct of International Research
 
US Staffing - Sumit
US Staffing - SumitUS Staffing - Sumit
US Staffing - Sumit
 
2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
2HIIT 102 Health Care Delivery SystemsMulti-Phase Research P.docx
 
US IT Recruitment
US IT RecruitmentUS IT Recruitment
US IT Recruitment
 
DCR Workforce May 2013 Trendline Report
DCR Workforce May 2013 Trendline ReportDCR Workforce May 2013 Trendline Report
DCR Workforce May 2013 Trendline Report
 
US Staffing Training
US Staffing TrainingUS Staffing Training
US Staffing Training
 
H-1B Requests for Evidence: An Inevitability?
H-1B Requests for Evidence: An Inevitability?H-1B Requests for Evidence: An Inevitability?
H-1B Requests for Evidence: An Inevitability?
 
Predicting Grant Success: University of Melbourne
Predicting Grant Success: University of MelbournePredicting Grant Success: University of Melbourne
Predicting Grant Success: University of Melbourne
 
Pilot Study 2 on Processes for Determining the Accuracy of Credit Bureau Info...
Pilot Study 2 on Processes for Determining the Accuracy of Credit Bureau Info...Pilot Study 2 on Processes for Determining the Accuracy of Credit Bureau Info...
Pilot Study 2 on Processes for Determining the Accuracy of Credit Bureau Info...
 
Canada Healthcare Claims Management Market Analysis Sample Report
Canada Healthcare Claims Management Market Analysis Sample ReportCanada Healthcare Claims Management Market Analysis Sample Report
Canada Healthcare Claims Management Market Analysis Sample Report
 
Cn global partners provides EB-3 Visa Workers
Cn global partners provides EB-3 Visa WorkersCn global partners provides EB-3 Visa Workers
Cn global partners provides EB-3 Visa Workers
 
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docxPart 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
 
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docxPart 1 Interest RatesMacroeconomic factors that influence inter.docx
Part 1 Interest RatesMacroeconomic factors that influence inter.docx
 
Hiring in the Software & Data Science Sector - D.C. Metro Area
Hiring in the Software & Data Science Sector - D.C. Metro AreaHiring in the Software & Data Science Sector - D.C. Metro Area
Hiring in the Software & Data Science Sector - D.C. Metro Area
 
Auto-Enrollment Retirement Plans For The People Choices And Outcomes In Oreg...
Auto-Enrollment Retirement Plans For The People  Choices And Outcomes In Oreg...Auto-Enrollment Retirement Plans For The People  Choices And Outcomes In Oreg...
Auto-Enrollment Retirement Plans For The People Choices And Outcomes In Oreg...
 
Phase 1 - Research Data CollectionName Points.docx
Phase 1 - Research Data CollectionName Points.docxPhase 1 - Research Data CollectionName Points.docx
Phase 1 - Research Data CollectionName Points.docx
 

141_Project_try2

  • 1. An Explorative Study of H1-B Visas Rosy Garcia-Rivas, Kevin Huang, Macaria Robinson, Meredith Valenzuela May 2015 Abstract The H-1B visa is a non-immigrant visa in the US that allows US employers to temporarily employ foreign workers in specialty occupations. Our group conducted a population-based assessment of H-1B visa approval ratings to predict whether or not a visa would be approved to an applicant, based on many different factors. We found that applicants in math/science industries were the highest proportion of applicants for H-1B visas. Furthermore, we found that people do not solely follow areas of highest pay, as people in the health and math/science fields tended to move to the Northeast, while the Midwest and West coast pay higher rates. What we can extrapolate from this information is that people in different fields tend to flow to their particular regions of interest for different reasons. 1
  • 2. Contents I Introduction 3 II Questions 3 III Description of the Sample and Data Collection 3 IV Variables of the Study and How They Were Measured 4 V Statistical Methods 5 1 Logistic Regression 5 2 Contingency Tables 5 VI Summary of Findings 6 VII Conclusions Drawn From the Study 8 VIII Shortcomings 9 IX Recomendation for Future Research 9 X R Code 9 List of Tables 1 Table of Main Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Logistic Regression: Wage Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Logistic Regression: Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Applicants, Wage by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5 Applicants by Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 6 Proportion of Certified Applicants by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 7 Proportion of Certified Applicants Applying by Industry . . . . . . . . . . . . . . . . . . . . . 8 8 Two Way Table of Median Wage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 List of Figures 1 Map of Employed H1-B Visa Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Barplot of Industries by Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2
  • 3. Part I Introduction The H-1B visa is a non-immigrant visa in the United States that allows US employers to temporarily employ foreign workers in specialty occupations. The minimum requirements for obtaining this visa classification are: (1) the applicant must have a US employer to sponsor him/her (2) the job that the applicant is applying must be requiring bachelor’s degree or higher (3) the job duties and the applicant’s education/work experience must be correlated (4) the job must pay at least the prevailing wage in the area for that service More than ten percent of the undergraduate population at the University of California, Los Angeles are international students. The H-1B visa is important tomany of these students because it is one avenue for them to remain in the US after graduation. Therefore, our group wanted to investigate the likelihood of, and significant contributors to, obtaining an H-1B visa. Certain industries, such as the technology and finance industries, congregate in specific locations around the US, for example San Francisco and New York. Moreover, we have observed that there seems to be a higher proportion of international students in certain academic fields compared to others at UCLA. Therefore, we expected the strongest contributing factors of whether an H-1B visa was approved or not to be: (a) industry the petitioner would be employed in (b) geographic location of the job (c) the interaction between industry and geographic location Unfortunately, our data for showed a 99% approval rate for H-1B visa approvals and was not represen- tative of the actual acceptance rate in the US in 2007, which meant attempting to determine the strongest contributing factors of H-1B visa approval and predicting whether a visa would be approved or not would be meaningless with our data. Therefore, we chose to focus on finding extracting information about the applicants of approved H-1B visas, focusing mainly on their chosen industry, academic field, and the geo- graphic location of their employment. Given that the US is a hub for cutting edge research and technology in health, biotechnology, and business/finance, and is also a strong force in pop culture and the arts, we thought this endeavor would be the most informative, and give us insight on why people wanted to move to the US, and subsequently where the US stands in the eyes of outsiders. Furthermore, we hope that a deeper understanding of approved visas will provide future applicants with a better idea of which locations and industries to consider. Part II Questions The following are the questions that we hope to answer through our analysis of the H1-B Visa dataset. • Is the approval likelihood higher for people working in certain fields than other fields? • Is the likelihood higher for people applying for an H-1B visa in different regions? • Is the mean prevailing wage for people applying for the H-1B within the industries higher in certain regions than others? • Can we predict whether a person’s H-1B visa will get approved based on certain variables? 3
  • 4. Part III Description of the Sample and Data Collection The H1-B data used in this report was provided to ETA by employers who submitted foreign labor cer- tification applications for the year 2007. The original data set has 426,597 observations and 39 variables, although we chose to limit our study to H1-B visas that were approved. The sample includes information about the employer’s location, the job position, and salary, but does not offer any information about the applicants themselves, such as gender, age, or country of origin. It is important to note that the sample data was not supplied by H1-B applicants themselves, but rather by their employers to the Bureau of Labor Statistics’ Occupational Employment Statistics survey. The Bureau of Labor Statistics’ (BLS) Occupational Employment Statistics program provides estimates used to assist in setting the wage levels in the Foreign Labor Center (FLC) wage library. Figure 1: Map of Employed H1-B Visa Workers Map based on location. Color shows detail about Industry. Size shows median wage. Data excludes Alaska and Hawaii. BLUE = Education. ORANGE= Engineering and Computer Sciences. GREEN= Finance and Business. RED= Health. PURPLE= Science and Math. Figure 1 shows the distribution of H1-B workers by industry and wage. From this graph we see that most applicants are focused on the coasts, with a significant portion spread out throughout the Midwest. The mountainous regions in the west are more sparsely populated. Applicants in the health and science/math industries appear to be the highest proportion of, and richest, approved visa applicants across the country. Moreover, this figure shows that health and education jobs are distributed in a way that appears to follow the US population’s overall distribution. This is in stark contrast to the distribution of business/finance and science/math applicants, who seem to cluster around big cities. 4
  • 5. Part IV Variables of the Study and How They Were Measured Variables of interest for this study centered around industry type, geographical location, and salary. The variables Industry and Region were created from the variables Job Title and State, respectively. The levels of Industry are Finance and business, Health, Education, Science/Math, and Engineering and Computers, while the levels of Region are Midwest, Northeast, South, and West. We chose to categorize Industry by these fields because we felt these professions were encompassing of the careers of H1-B visa workers. Additionally, Case Status initially included four different levels: Certified, Denied, Hold, and Pending. Since we are primarily concerned whether a candidate will be approved or not we subsetted the data to include only certified and denied cases. Variable Description Measurement Industry Industry of Employment Categorical Region Region of Employment Categorical Case Status Approval status- certified or denied Categorical Wage_Rate_From_1 Employer’s proposed wage rate Numerical Table 1: Table of Main Variables Part V Statistical Methods 1 Logistic Regression Our initial focus for this research was to classify whether or not a candidate’s visa application would be accepted or denied, and to determine which factors are most important to being approved. Because our response was binary (“Accepted” vs. “Denied”), we decided to run a logistic regression. Logistic regression would allow us to classify the visa petitions while steering clear of other complex models that could possibly overfit the data. To begin, we wanted to ensure we had a portion of the data to test our model with, so we cut 30% of the cases into a testing dataset and used the remaining 70% of the cases to build our model. We built our logistic regression model using backward selection on the training dataset to see which model provided the lowest Akaike Information Criterion (AIC). The AIC punishes models for complexity, thus preventing a model from overfitting the data that it was constructed with. From there we used cross-validation techniques to see which of the models with the lowest AIC also provided the lowest test mean square error. When tabling our success rate, we learned that our model performed well when predicting an approved visa but performed very poorly when predicting a denied visa. To investigate why our model’s accuracy was highly skewed we tabled the data and saw that the data itself was significantly skewed towards approved visas, where approved visas made up 99% of the observations. Moving forward, we then decided to construct a simple logistic regression models with each predictor to decide which of the levels of each variable were above the 95% significance level. The simple logistic regression provided insight that we applied to our next method, contingency tables. 2 Contingency Tables Due to the limitations of our data (99% of cases are certified), we used contingency tables to provide deeper analyses of the relationships between region, industry, and wage. Contingency tables were beneficial in com- paring the frequencies of different combinations of industries and location. Getting a better understanding 5
  • 6. of the popularity of industries in different locations allowed allows us to create a profile of characteristics of H1-B visa workers and employers, and hopefully will allow us to inform our peers and others of ideal places to work. Part VI Summary of Findings After building our logistic regression models we found that most of the models we obtained gave a successful prediction rate of around 99%. Since 99% of the applicants in our data set were “certified”, the logistic regression by itself was not appropriate to answer our questions. All the variables, except for the wage, were statistically insignificant. See table 2. Therefore, we ran three simple logistic regression models to inform us of which levels of factors were potentially significant for industry and region, and then used contingency tables to analyze the prominence of those factors. For every one unit increase in wage, the likelihood of certification increases by 1. Looking at the con- tingency tables, we can see that applicants that have applied for a H1-B visa in the health industry, versus the business & finance industry, changes the odds by 1.4. On the other hand, . . . . (Can we add another example here because I’m still not understanding the table of odds) We see that the health and business & finance industries have the greatest positive impacts on certification. We then investigated region through the use of contingency tables. We began by looking at the candidates in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast while only 16% applied in the Midwest (see table 3). After investigating the candidates by industry, we found that the science & math industry dominates in popularity, with the vast majority of immigrants (68.7%) applying for work in these fields (see table 3). It is our belief that the United State’s gap in science & math education - in comparison to other developed countries - is one reason for the popularity of the industry. Although there is a relatively small percentage of US natives seeking careers in science and math there is a great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields. Variable Coefficient p-value Intercept -3.18 <2e-16 Wage_Rate_From_1 -2.46e-5 < 2e-16 Odds 0.04 1.00 yi = 3.18 + WageRateF rom1 ⇤ xi Table 2: Logistic Regression: Wage Rate For every one unit increase in wage, the likelihood of certification increases by 1. Variable Coefficient p-value Intercept -5.99 <2e-16 Industry:Education -0.1359 0.4495 Industry:Health 0.3960 0.0313 Industry:Science & Math -0.3169 0.0165 Odds 0.003 0.87 1.4 0.72 yi = 5.99 + Industry ⇤ xi Table 3: Logistic Regression: Industry Having applied for a H1-B visa in the Health industry, versus the Business & Finance industry, changes the odds by 1.4. We see that the Health and Business & Finance industries have the greatest positive impacts on certification. We then investigated region through the use of two by two tables. We began by looking at the candidates in the Northeast region of the United States. The majority (33%) of candidates applied in the Northeast 6
  • 7. while only 16% applied in the Midwest region (see table 3). After investigating the candidates by industry, we found that the Science and Math industry dominates in popularity, with the vast majority of immigrants (68.7%) applying for work in these fields (see table 3). It is our belief that the United State’s gap in science/math education - in comparison to other developed countries- is one reason for the popularity of the industry. Although there is a relatively small percentage of U.S. natives seeking careers in science and math there is a great interest of foreign individuals wanting to obtain visas to work in the U.S.’s in these fields. Northeast West Midwest South Number of Applicants 101,865 (33.3%) 66,304 (21.7%) 51,110 (16.7%) 86,547 (28.3%) Median Prevailing Wage $58,000 $70,620 $55,000 $55,000 Table 4: Applicants, Wage by Region Business&Finance Education Health Science & Math 41,487 (13.5%) 35,231 (11.5%) 19,206 (6.27%) 210,329 (68.7%) Table 5: Applicants by Industry Figure 2: Barplot of Industries by Region Because we are only interested in the characteristics of the applicants that get certified, we subsetted the data by removing all denied cases. Looking at Table 7, we are able to see that in the business & finance industry, 38.9% of the people that are certified reside in the Northeast, while only 11.50% of the people certified in business & finance reside in the Midwest. One possible explanation of these results is that the low population density and small number of big cities causes immigrants in the business & finance industry to avoid the Midwest. Interestingly enough, only 21.27% of the applicants are certified to work in the West. We expect this is because the Northeast contains NYC and therefore Wall Street, which carries a big name and is the central hub for finance in the US, drawing people away from the Midwest and West coast. Interestingly, 39.5% of the certified foreigners in the education industry (Table 7) are migrating to the South. Looking at the two-way table of the median 7
  • 8. start wage by region and by industry in conjunction with Figure 2, we see that all of the regions pay the highest in the health industry except the Northeast, yet the majority of approved applicants are moving to the Northeast in spite of this fact. West Midwest Northeast South Business & Finance 0.1336 0.0933 0.1584 0.1339 Education 0.0959 0.1226 0.0835 0.1607 Health 0.0522 0.0782 0.0638 0.0599 Science and Math 0.7181 0.7057 0.6941 0.6453 Table 6: Proportion of Certified Applicants by Region West Midwest Northeast South Business & Finance 0.2317 0.1149 0.3892 0.2796 Education 0.1806 0.1779 0.2415 0.3950 Health 0.1805 0.2084 0.3393 0.2703 Science and Math 0.2263 0.1714 0.3361 0.2654 Table 7: Proportion of Certified Applicants Applying by Industry West Midwest Northeast South Business & Finance $61,000 $57,600 $62,000 $53,672 Education $47,500 $45,000 $47,000 $42,500 Health $85,000 $65,000 $56,650 $71,760 Science and Math $75,000 $55,000 $58,240 $58,000 Table 8: Two Way Table of Median Wage To summarize, only the business & finance industry followed the region that pays the most. For the health industry, the West pays the H-1B applicants the most; however, 33.9% of the applicants in health are migrating to the Northeast while only 18% are certified in the West. The same is true of the science & math industry: 33% of certified H-1B foreigners are migrating to the Northeast, even though the West pays 1.28 times more. Also, looking at the map (Figure 1), we see that the jobs in the health and education industries are distributed in a similar way to the distribution of the US population, indicating an evenly growing demand throughout the US for education and health workers. On the other hand, the jobs in business & finance and science & math are more clustered in bigger US cities, indicating a need for large populations for these industries and possibly suggesting a need for faster communication in these industries. Part VII Conclusions Drawn From the Study According to Table 4 the Midwest, when compared to the Northeast is a popular location for immigrant workers granted H1-B visas status for the Education, Engineering/ Comp. Sciences, Finance and Business, and Science and Math but the Health industry. It also appears as though the Northeast, when compared to the South, is a less popular location of work for those same industries, excluding Health. After analyzing all the contingency tables, we discovered that in both Business & Finance industry and for the Education industry, the highest proportion of applicants for the H-1B visa tend to follow the regions where the prevailing wage is the highest. However, for the Health industry, the Midwest actually pays the most; however, 33.9% of the applicants in Health are migrating to the Northeast instead, while only 20% are migrating to the Midwest. The same happens for Science and Math; 33.6% of certified H-1B foreigners are 8
  • 9. migrating to the Northeast, even though the West pays 1.19 times more. We recommend that individuals seeking H1-B visa status in the field of: • Health to move to the Midwest • Business and Finance to move to the west, but slightly less chances of getting certified science • Math are wanted in all regions but to get the highest pay to move to the west. Part VIII Shortcomings The shortcomings of our research are due to a non-representative sample of H1-B visa applicants. Given a sample with a more rate of visa approval more consistent with reality would have allowed us to create a more ideal logistic regression model. Moreover, a lack of information about applicants, i.e. gender, country of origin, networth, etc., kept us from drawing more comprehensive conclusions about H1-B applicants and the likelihood of certification. Because of this, we have too many unknown covariates to account for. Working with a dataset that was highly skewed was problematic but at this time in the project we had already ran into issues. We were initially assigned an education dataset. However, given that we had already worked with similar data in another class we felt it best to gain exposure in something different. Since all of our groupmates were interested in criminal data we found a dataset that surveyed prisoners. Unfortunately, this dataset had a lot of missing information and asked over three thousand questions. Thus, we ultimately moved toward the dataset of H-1B visas which appealed to our entire group and was rich with information. Part IX Recomendation for Future Research We recommend doing a more comprehensive analysis using both employer data and applicant data. The country of origin of the applicant or the gender of the applicant might have a more significant impact on visa approval than location of employment or industry Part X R Code load("~/Downloads/H1B.RData") # Getting Only Cases that are Certified or Denied c=which(h1b$Case_Status=="Certified") d=which(h1b$Case_Status=="Denied") k=c(c,d) hb=h1b[k,] #Getting to know the data #Dealing with dates for the glm #What time of the year is the petition submitted? library(lubridate) hb$timeyr=month(as.POSIXlt(hb$Submitted_Date, format="%d/%m/%Y")) #Gets the month that it was submitted #Breaks down into what part of the year hb$timeoyr=rep("End",length(hb$timeyr)) hb$timeoyr[which(hb$timeyr<9)]="Middle" hb$timeoyr[which(hb$timeyr<5)]="Beginning" #Variable the time the visa would last? 9
  • 10. hb$visa_last=as.numeric(hb$End_Date-hb$Begin_Date)/364.25 hb$visa_last=round(as.numeric(hb$visa_last),1) #Cleaning up State Variable and Creating Regions #States were sorted according to Regions using the Census data hb$State=as.character(hb$State) temp=nchar(hb$State, type = "chars", allowNA = FALSE) st.rm=which(temp>2) hb$State[which(hb$State=="Newton")]="MA" hb$State[which(hb$State=="Seattle")]="WA" hb$State[which(hb$State=="Chantilly")]="VA" hb$State[which(hb$State=="New York")]="NY" #Create a Region Variable to get rid of all the levels in State hb$Region<-rep("",dim(hb)[1]) NE<-c("CT","PA","NJ","NY","RI","NH","VT","ME","MA") MW<-c("ND","SD","NE","KS","MO","IA","MN","WI","IL","IN","OH","MI") S<-c("MD","DE","DC","VA","WV","KY","TN","GA","AL","MS" ,"FL","AR","LA","OK","TX","NC","SC") W<-c("WA","ID","MT","WY","CO","UT","AZ","NM","NV","CA","OR","AK","HI") hb$Region[which(hb$State %in% NE)]="NorthEast" hb$Region[which(hb$State %in% MW)]="MidWest" hb$Region[which(hb$State %in% S)]="South" hb$Region[which(hb$State %in% W)]="West" #Recoding Occupational Code by Industry hb$Industry=rep(NA,dim(hb)[1]) hb$Job_Title=as.character(hb$Job_Title) hb$Job_Title<-iconv(enc2utf8(hb$Job_Title),sub="byte") hb$Job_Title=lapply(hb$Job_Title,tolower) #Engineers and Computer Related archit<-grep("archit",hb$Job_Title) eng<-grep("engin",hb$Job_Title) tech<-grep("tech",hb$Job_Title) comp<-grep("com",hb$Job_Title) prog<-grep("progr",hb$Job_Title) dev<-grep("developer",hb$Job_Title) sysans<-grep("systems analyst",hb$Job_Title) sysan<-grep("system analyst",hb$Job_Title) soft<-grep("software",hb$Job_Title) CE<-c(archit,eng,tech,comp,prog,dev,sysans,sysan,soft) hb$Industry[CE]="Engineering & Computer" dat<-grep("data",hb$Job_Title) math<-grep("math",hb$Job_Title) stats<-grep("statist",hb$Job_Title) chem<-grep("chemis",hb$Job_Title) SM<-c(dat,math,stats,chem) hb$Industry[SM]="Science & Math" med<-grep("medic",hb$Job_Title) clinic<-grep("clinic",hb$Job_Title) phys<-grep("physic",hb$Job_Title) physo<-grep("physio",hb$Job_Title) dentist<-grep("denti",hb$Job_Title) dental<-grep("dental",hb$Job_Title) pathol<-grep("pathol",hb$Job_Title) pharm<-grep("pharm",hb$Job_Title) sci<-grep("scientist",hb$Job_Title) 10
  • 11. olog<-grep("ologist",hb$Job_Title) nurse<-grep("nurse",hb$Job_Title) trist<-grep("trist",hb$Job_Title) ped<-grep("pediat",hb$Job_Title) H<-c(med,clinic,phys,physo,dentist,dental,pathol,pharm,sci,olog,nurse,trist,ped) hb$Industry[H]="Health" acc<-grep("account",hb$Job_Title) actuar<-grep("actuar",hb$Job_Title) fin<-grep("finan",hb$Job_Title) budg<-grep("budg",hb$Job_Title) econ<-grep("econom",hb$Job_Title) bus<-grep("busine",hb$Job_Title) assoc<-which(hb$Job_Title=="associate") re<-grep("real",hb$Job_Title) mark<-grep("market",hb$Job_Title) pm<-grep("project manage",hb$Job_Title) sales<-grep("sales",hb$Job_Title) BIZ<-c(acc,actuar,fin,budg,econ,bus,assoc,re,mark,pm,sales) hb$Industry[BIZ]="Finance and Business" prof<-grep("prof",hb$Job_Title) fel<-grep("fellow",hb$Job_Title) res<-grep("research associate",hb$Job_Title) phd<-grep("postdoc",hb$Job_Title) teach<-grep("teacher",hb$Job_Title) lect<-grep("lectur",hb$Job_Title) instr<-grep("instruct",hb$Job_Title) ot<-grep("occupational therapist",hb$Job_Title) pd<-grep("post doctoral",hb$Job_Title) EDU<-c(prof,fel,res,phd,teach,lect,instr,ot,pd) hb$Industry[EDU]="Education" law<-grep("law",hb$Job_Title) attorn<-grep("attor",hb$Job_Title) leg<-grep("legal",hb$Job_Title) L<-c(law,attorn,leg) hb$Industry[L]="Law" design<-grep("designe",hb$Job_Title) graph<-grep("graphic",hb$Job_Title) fash<-grep("fash",hb$Job_Title) ARTS<-c(design,graph,fash) hb$Industry[ARTS]="Arts" urb<-grep("urban",hb$Job_Title) social<-grep("social",hb$Job_Title) worker<-grep("worker",hb$Job_Title) SOC<-c(urb,social,worker) hb$Industry[SOC]="Public Service" hb$Industry[is.na(hb$Industry)]= "Other" phony<-c(eng,comp,acc,fin,prog,soft,prof,fel,res,phd,sysan,sysans,dev,dat,fash,tech,teach,law,attorn ,actuar,lect,mark,math,design,phys,physo,dentist,pathol,budg,econ,pharm,sci,stats,bus ,clinic,social,med,pm,sales,leg,re,archit,olog,nurse,chem,trist,dental,worker,instr,assoc, ot) View(hb[-phony,]) ph<-as.character(ph<-hb$Job_Title[-phony]) ########## ########## End of Cleaning Code. 11
  • 12. ########## Following R Code utilizes the cleaned data set we submitted ########## #********************************BEGIN USE OF CLEANED DATA SET***********************************# hb$Industry<-relevel(as.factor(hb$Industry),"Engineering & Computer") #splitting into training and testing or set.seed(7) test=sample(1:dim(hb)[1],(dim(hb)[1]*.3)) train=hb[-test,] testing=hb[test,] #######MODELING ATTEMPTS###################################### ######t<-glm(Case_Status~visa_last,data=hb,family="binomial") ######tt<-glm(Case_Status~timeoyr,data=hb,family="binomial") ######ttt<-glm(Case_Status~Nbr_Immigrants,data=hb,family="binomial") ######t4<-glm(Case_Status~Prevailing_Wage_1,data=hb,family="binomial") ######t5<-glm(Case_Status~Region,data=hb,family="binomial",subset=train) ######total<-glm(Case_Status~visa_last+Region+timeoyr+Program.Designation+ ######Nbr_Immigrants+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1,data=train,family="binomial") ######step(total,direction="backward") ######sb1<-glm(formula = Case_Status ~ visa_last + timeoyr + Program.Designation + ###### Wage_Rate_From_1 + Wage_Rate_Per_1 + Part_Time_1 + Prevailing_Wage_1, ###### family = "binomial", data = train) #AIC 9549 ######FUll<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry,data=train,family="binomial") #AIC 9580 ######step(FUll,direction="backward") ######AIC(FUll) #full with Visa last interactions/ AIC 9356 ######FwithvlI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1 ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry,data=train,family="binomial") #full with visa_last & time of year interaction/AIC 9287.995 ######F2I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+ ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+ ###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry,data=train,family="binomial") #AIC=58633 ######F3I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+ ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+ ###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+ ###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+ ###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu ###### data=train,family="binomial") ######F4I<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+ 12
  • 13. ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+ ###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+ ###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+ ###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu ###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry ###### ,data=train,family="binomial") #COmpletely full with all possible Interaction Plots ###### CFI<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+ ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+ ###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+ ###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+ ###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu ###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry+ ###### Prevailing_Wage_1*Withdrawn+Prevailing_Wage_1*Industry+Withdrawn*Industry ###### ,data=train,family="binomial") ######step(CFI,direction="backward") ####### Semi full model, missing some interaction plots ######Csf<-glm(Case_Status~visa_last+timeoyr+Wage_Rate_From_1+Wage_Rate_Per_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+Industry+visa_last*timeoyr+visa_last*Wage_Rate_From_1+ ###### visa_last*Wage_Rate_From_1+visa_last*Wage_Rate_Per_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn+visa_last*Industry+ ###### timeoyr*Wage_Rate_From_1+timeoyr*Wage_Rate_Per_1+timeoyr*Part_Time_1+ ###### timeoyr*Prevailing_Wage_1+timeoyr*Withdrawn+timeoyr*Industry+ ###### Wage_Rate_From_1*Wage_Rate_Per_1+Wage_Rate_From_1*Part_Time_1+ ###### Wage_Rate_From_1*Prevailing_Wage_1+Wage_Rate_From_1*Withdrawn+Wage_Rate_From_1*Indu ###### Part_Time_1*Prevailing_Wage_1+Part_Time_1*Withdrawn+Part_Time_1*Industry, ###### data=train,family="binomial") ######step(Csf,direction="backward") ######phony<-glm(Case_Status~Wage_Rate_From_1+Part_Time_1+ ###### Prevailing_Wage_1+Withdrawn+visa_last*Wage_Rate_From_1+visa_last*Part_Time_1+ ###### visa_last*Prevailing_Wage_1+visa_last*Withdrawn ###### ,data=train,family="binomial") ################################################################ #Simplify the Industry Data table(hb$Industry) hb$newIndustry <- NULL hb$newIndustry[hb$Industry %in% c("Engineering & Computer", "Science & Math")] <- "Science & Math" hb$newIndustry[hb$Industry %in% c("Finance and Business")] <- "Business and Finance" hb$newIndustry[hb$Industry %in% c("Health")] <- "Health" hb$newIndustry[hb$Industry %in% c("Education")] <- "Education" #removing the levels of Case_status hb$Case_Status <- as.character(hb$Case_Status) hb$Case_Status <- factor(hb$Case_Status) #disregard all other observations that are not paid by year newhb <- hb[hb$Wage_Rate_Per_1 %in% c("Year"),] #then we are disregarding all other observations newhb <- newhb[newhb$newIndustry %in% c("Science & Math", "Business and Finance", "Health", "Education"),] newhb2 <- newhb # subset to use for Tableau 13
  • 14. h1bTab= subset(hb1, Case_Status=="Certified", select=c(State, Region, Industry, Wage_Rate_From_1, Zip_Code)) write.csv(h1bTab, "h1bTab.csv") # Contingency Table of Certified Cases table(newhb$Region, newhb$Industry, newhb$Case_Status) #Contingency tables table(newhb2$Case_Status) table(newhb2$newIndustry) table(newhb2$Region[!newhb2$Region==""]) tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""], newhb2$Region[!newhb2$Region==""], me- dian) tapply(newhb2$Wage_Rate_From_1, newhb2$newIndustry, median) tapply(newhb2$Wage_Rate_From_1[!newhb2$Region==""], list(newhb2$newIndustry[!newhb2$Region==""], newhb2$Region[!newhb2$Region==""]), median) #The Median Starting Wage in Business and Finance is Highest in the NorthEast #The Median Starting Wage in Education is Highest in the South #The Median Starting Wage in Science and Math is Highest in the West #The Median Starting Wage in Health is Highest in the West #Distribution of People who are certified table(newhb$newIndustry[newhb$Case_Status=="Certified"]) table(newhb$Region[newhb$Case_Status=="Certified"]) table(newhb$Region[newhb$Case_Status=="Certified"], newhb$newIndustry[newhb$Case_Status=="Certified"]) prop.table(table(newhb$Region[newhb$Case_Status=="Certified"], newhb$newIndustry[newhb$Case_Status=="Certified"]), 1) #In all of the Regions, we see that the highest percentage of people that are #getting denied are all from Science and Math Industry. #The lowest in the Midwest denying rate are in the Business and Finance Industry #The lowest in the Northeast denying rate are in the Education and Health #The lowest in the Westt denying rate are in the Education #The lowest in the South denying rate are in the Health Industry prop.table(table(newhb$Region[newhb$Case_Status=="Certified"], newhb$newIndustry[newhb$Case_Status=="Certified"]), 2) #In the Business and Finace Industry the lowest denying rate is from the NorthEast Region #In the Education Industry the lowest denying rate is from the South Region #In the Health Industry the lowest denying rate is from the Midwest Region #In the Science and the Math Industry the lowest denying rate is from the NorthEast Region ## ANOTHER MUCH SIMPLE LOGISTIC REGRESSION #Creating training and testing data. 70% Training and 30% Testing data set.seed(55555) train <- newhb2[sample(nrow(newhb2), nrow(newhb2)*.70), ] test <- newhb2[!(row.names(newhb2) %in% row.names(train)),] #All simple models are statistically significant glm1 <- glm(Case_Status~Region, data=train, family="binomial") summary(glm1) glm2 <- glm(Case_Status~newIndustry, data=train, family="binomial") summary(glm2) glm3 <- glm(Case_Status~Wage_Rate_From_1, data=train, family="binomial") summary(glm3) #2 factors+interaction model+covariate model glm.model <- glm(Case_Status~Region+factor(newIndustry)+ Wage_Rate_From_1+Region:newIndustry, data=train, family="binomial") summary(glm.model) 14
  • 15. #REGIONS ARE NOT STATISTICALLY SIGNIFICANT #Only the WAGE IS STATISTICALLY SIGNIFICANT #THE TYPE OF INDUSTRY IS NOT SIGNIFICANT #NONE OF THE INTERACTIONS ARE SIGNIFICANT. #CHECKING PREDICTION ERROR pred.vals=predict(glm.model, test, type="response") # pred=ifelse(pred.vals >median(pred.vals), "Denied", "Certified") table(pred, test$Case_Status)/(length(pred.vals)) error <- 1-sum(diag(table(pred, test$Case_Status)/(length(pred.vals))))/ sum(table(pred, test$Case_Status)/(length(pred.vals))) error #NULL DEVIANCE TEST glm.model$null.deviance glm.model$df.null pchisq(glm.model$null.deviance-glm.model$deviance, 1, lower=FALSE) #Very small pvalue, therefore, we reject the null hypothesis indicating that we have #statistically significance to show that the slope of the logistic regression line is not equal to zero #RESIDUAL DEVIANCE TEST pchisq(glm.model$deviance,glm.model$df.residual,lower=FALSE) #VERY large pvalue. we fail to reject our null hypothesis. 15
  • 16. Appendix Variable Description Submitted_Date Date the application was submitted Program.Designation Types of H-1B Visas Employer_Name Employer’s name Address_1 Employer’s address City Employer’s city State Employer’s state Zip_Code Employer’s postal code Nbr_Immigrants Number of job openings Begin_Date Proposed begin date End_Date Proposed end date Job_Title Job title DOL_DecisionDate Date certified or denied Certified_Begin_Date Certification start date Certified_End_Date Certification end date Occupation_Code Three digit occupational group Case_Status Approval status- certified or denied Wage_Rate_From_1 Employer’s proposed wage rate Wage_Rate_Per_1 Unit of pay for proposed wage rate Wage_Rate_To_1 Maximum proposed wage rate Part_Time_1 Y = Part time; N = Full time position Work_City_1 Work city (location of the job opening) Work_State_1 Work_State_1 Prevailing_Wage_1 Prevailing wage rate Prevailing_Wage_Source_1 Collective bargining; SESA; Other Year_Source_Published_1 Year that the prevailing wage data was published Other_Wage_Source_1 Year that the prevailing wage data was published Other_Wage_Source_2 Description of the Other wage source 16