Bridge_US

IMMIGRATION SIMPLIFIED
Classifying Immigration Documents for Bridge
US
By Baolin Liu

About Bridge US:
• Immigration service provider that
streamlines the US visa application
process for employers.
• Works with companies like Getaround,
Allegiant, and American Family Insurance.
• Over 50k immigration related documents
managed.

Motivation
• Typical process includes over 20 files
• Identifying documents takes time
• Files are misplaced
• No immediate feedback

Goals:
• Accurately identify documents
• Guide users to provide what’s missing
• Real-time feedback
• less intensive document oversight

Data Collect and “Munging”
• Text Extraction with Textract (OCR)
• Regex commands
• Stemmed the words
• PySpark
Before: After:

Classification Algorithm
• 2 pathways:
-Selectable text
-Image text
• Assigned probability
http://cheaper-than-tuition.com/shop/fake-diploma/

Modeling: SVC
• Linear Kernel
• Handled imbalanced classes
• Dealt with low samples.
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

Modeling: Random Forest
http://blog.yhat.com/posts/random-forests-in-python.html
• Repeated sampling, or “bootstrapping”
• “Wisdom of the Crowd”

Confusion Matrix
SVC: 48 misclassified
Random Forest: 60 misclassified

Cons for both models:
• SVC: False high confidence
• Random Forest: Bad at low samples
Winner: SVC-imbalanced classes more important

Moving code to
production:
• Classify only the first page
• Classify the document as an Image
• Took out spell corrector
• Model Persistence, pickling
• Py files to launch from command line
• AWS

Production Testing
Data:
• 1664 PDF files from 87 case files
Initial Findings:
• more classes: job offer, wage surveys, tax returns, org charts, W-2
Did Well:
• government forms, pay stubs and resumes
Did Moderate:
• Transcripts, Diplomas, Birth Certificates, Marriage Certificates
Did Poor:
• Passports

Preliminary Data Engineering
Architecture
Reporting Bucket
Case files, monitored for changes Temp Folder, Classify Files
https://www.eyemovic.com/blog_it/6749.php
https://www.conceptdraw.com
Boto, the S3 interface
Sends details of this bucket
Keep “Misclassified” Documents

Future Work:
• Retrain the model
• Apply statistics
• Add to Data Engineering Architecture
• Apply an Image classifier

Retrain Model:
• Use boxplot to identify outliers
of low probability that don’t
belong to that class
• only below *minimum*
• Convert all documents into
matrix form
• Use Dot product to find
document similarity (0,1)
• document similarity = hashvector
* hashvector.T
http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents

Retrain Model(cont):
• Use heat map or correlation matrix to identify new
groups

Retrain Model(cont)
• Look at the file contents, give the documents a label, repeat to validate

Retrain Model(cont):
• Use statistics to identify sample size for class
• Bigger standard deviation, more samples(randomized) to
needed capture the variance
myweb.facstaff.wwu.edu

Final Thoughts
• No one else has this technology in
immigration
• Used all my skillsets
• Learned a whole new set of skills

Bridge_US

More Related Content

Viewers also liked

Similar to Bridge_US

Bridge_US

Editor's Notes