IMMIGRATION SIMPLIFIED
Classifying Immigration Documents for Bridge
US
By Baolin Liu
About Bridge US:
• Immigration service provider that
streamlines the US visa application
process for employers.
• Works with companies like Getaround,
Allegiant, and American Family Insurance.
• Over 50k immigration related documents
managed.
Motivation
• Typical process includes over 20 files
• Identifying documents takes time
• Files are misplaced
• No immediate feedback
Goals:
• Accurately identify documents
• Guide users to provide what’s missing
• Real-time feedback
• less intensive document oversight
The Data:
Data Collect and “Munging”
• Text Extraction with Textract (OCR)
• Regex commands
• Stemmed the words
• PySpark
Before: After:
Classification Algorithm
• 2 pathways:
-Selectable text
-Image text
• Assigned probability
http://cheaper-than-tuition.com/shop/fake-diploma/
Modeling: SVC
• Linear Kernel
• Handled imbalanced classes
• Dealt with low samples.
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
Modeling: Random Forest
http://blog.yhat.com/posts/random-forests-in-python.html
• Repeated sampling, or “bootstrapping”
• “Wisdom of the Crowd”
Confusion Matrix
SVC: 48 misclassified
Random Forest: 60 misclassified
ROC Plots
Cons for both models:
• SVC: False high confidence
• Random Forest: Bad at low samples
Winner: SVC-imbalanced classes more important
Moving code to
production:
• Classify only the first page
• Classify the document as an Image
• Took out spell corrector
• Model Persistence, pickling
• Py files to launch from command line
• AWS
Production Testing
Data:
• 1664 PDF files from 87 case files
Initial Findings:
• more classes: job offer, wage surveys, tax returns, org charts, W-2
Did Well:
• government forms, pay stubs and resumes
Did Moderate:
• Transcripts, Diplomas, Birth Certificates, Marriage Certificates
Did Poor:
• Passports
Preliminary Data Engineering
Architecture
Reporting Bucket
Case files, monitored for changes Temp Folder, Classify Files
https://www.eyemovic.com/blog_it/6749.php
https://www.conceptdraw.com
Boto, the S3 interface
Sends details of this bucket
Keep “Misclassified” Documents
Future Work:
• Retrain the model
• Apply statistics
• Add to Data Engineering Architecture
• Apply an Image classifier
Retrain Model:
• Use boxplot to identify outliers
of low probability that don’t
belong to that class
• only below *minimum*
• Convert all documents into
matrix form
• Use Dot product to find
document similarity (0,1)
• document similarity = hashvector
* hashvector.T
http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents
Retrain Model(cont):
• Use heat map or correlation matrix to identify new
groups
Retrain Model(cont)
• Look at the file contents, give the documents a label, repeat to validate
Retrain Model(cont):
• Use statistics to identify sample size for class
• Bigger standard deviation, more samples(randomized) to
needed capture the variance
myweb.facstaff.wwu.edu
Final Thoughts
• No one else has this technology in
immigration
• Used all my skillsets
• Learned a whole new set of skills

Bridge_US

  • 1.
    IMMIGRATION SIMPLIFIED Classifying ImmigrationDocuments for Bridge US By Baolin Liu
  • 2.
    About Bridge US: •Immigration service provider that streamlines the US visa application process for employers. • Works with companies like Getaround, Allegiant, and American Family Insurance. • Over 50k immigration related documents managed.
  • 3.
    Motivation • Typical processincludes over 20 files • Identifying documents takes time • Files are misplaced • No immediate feedback
  • 4.
    Goals: • Accurately identifydocuments • Guide users to provide what’s missing • Real-time feedback • less intensive document oversight
  • 5.
  • 6.
    Data Collect and“Munging” • Text Extraction with Textract (OCR) • Regex commands • Stemmed the words • PySpark Before: After:
  • 7.
    Classification Algorithm • 2pathways: -Selectable text -Image text • Assigned probability http://cheaper-than-tuition.com/shop/fake-diploma/
  • 8.
    Modeling: SVC • LinearKernel • Handled imbalanced classes • Dealt with low samples. http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
  • 9.
    Modeling: Random Forest http://blog.yhat.com/posts/random-forests-in-python.html •Repeated sampling, or “bootstrapping” • “Wisdom of the Crowd”
  • 10.
    Confusion Matrix SVC: 48misclassified Random Forest: 60 misclassified
  • 11.
  • 12.
    Cons for bothmodels: • SVC: False high confidence • Random Forest: Bad at low samples Winner: SVC-imbalanced classes more important
  • 13.
    Moving code to production: •Classify only the first page • Classify the document as an Image • Took out spell corrector • Model Persistence, pickling • Py files to launch from command line • AWS
  • 14.
    Production Testing Data: • 1664PDF files from 87 case files Initial Findings: • more classes: job offer, wage surveys, tax returns, org charts, W-2 Did Well: • government forms, pay stubs and resumes Did Moderate: • Transcripts, Diplomas, Birth Certificates, Marriage Certificates Did Poor: • Passports
  • 15.
    Preliminary Data Engineering Architecture ReportingBucket Case files, monitored for changes Temp Folder, Classify Files https://www.eyemovic.com/blog_it/6749.php https://www.conceptdraw.com Boto, the S3 interface Sends details of this bucket Keep “Misclassified” Documents
  • 16.
    Future Work: • Retrainthe model • Apply statistics • Add to Data Engineering Architecture • Apply an Image classifier
  • 17.
    Retrain Model: • Useboxplot to identify outliers of low probability that don’t belong to that class • only below *minimum* • Convert all documents into matrix form • Use Dot product to find document similarity (0,1) • document similarity = hashvector * hashvector.T http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents
  • 18.
    Retrain Model(cont): • Useheat map or correlation matrix to identify new groups
  • 19.
    Retrain Model(cont) • Lookat the file contents, give the documents a label, repeat to validate
  • 20.
    Retrain Model(cont): • Usestatistics to identify sample size for class • Bigger standard deviation, more samples(randomized) to needed capture the variance myweb.facstaff.wwu.edu
  • 21.
    Final Thoughts • Noone else has this technology in immigration • Used all my skillsets • Learned a whole new set of skills

Editor's Notes

  • #4 won’t find out until a week later that the wrong file was submitted or something is missing Sometimes users don’t even care and start dropping files into buckets
  • #5 -act and move cases forward -help users place things in correct buckets, ESL students -marks instead of transcript -can get people to use the software, but doesn’t have to switch lawyers. Selling point.
  • #6 -had severe class imbalance issues -lack of information -some data was easier to obtain than others. -split columns
  • #7 -textract was the choice, all inclusive, pdf, jpg, doc, docx, etc -tied all the different packages under one roof -scrape out unicode with regex commands, don’t want that in the classifier -porter stemmer, snowball, lancaster was too aggressive -pyspark to distribute the workload, left my computer running overnight -AWS
  • #8 -selectable text fast -image use ocr
  • #10 -wanted to benefit from the wisdom of the crowd -repeat sampling
  • #12 -can’t build efficient trees
  • #13 Randomforest had a more natural probability SVC had to use distances away from the decision boundary Randomforest samples from nothing because of the low samples I chose SVC because not all samples will be big, need to use the entire sample. Also fast and accurate, production importance
  • #14 Cover page was most important Image was faster than pdf, exact same thing Pdf has more layers, image does not Used built in functions in python instead of using memory Picked a .41/hr ec2 instance, best price per speed Might do 2 pages
  • #15 Gov forms consistent, even with low samples Pay stubs after a while is continent Had huge sample for resume. Transcripts varied quit a bit Wasn’t able to OCR from Passports-image classifier
  • #16 Was not able to get lambda to work Download file classify, open csv, save send back to s3
  • #17 Strengthen existing classes Add new classes, at least twice maybe even 3 times as many classes as I have currently With enough data, write code to easily reclassify in a py file Preliminary model, seems to be working, but could add other bells and whistles to it , SQL database Image classifier with passports and sideways images, train a computer that this is a sideways diploma
  • #20 - (Vijeth Lomada used my code)
  • #22 Huge advantage is processing cases for immigration Drop a massive case folder and return results within minutes Big difference for a company that is processing hundreds of cases a week