How to Train
Your Classifier:
Create a Serverless Machine Learning System
with AWS and Python
PyData ✤ November 27th, 2017 ✤ apmetadata@ap.org
Classification
Parrots
Sandwiches
apmetadata@ap.org
apmetadata@ap.org
Tags
Why do you want tags
on your text content?
● Search, navigation,
recommendations
● Aggregation, routing
● Discoverability
○ properties
○ relationships
apmetadata@ap.org
Taxonomy
apmetadata@ap.org
Taxonomy
Jordan Larson
<http://cv.ap.org/id/9A7FD8FA87AD4A43BDD522B65147A808> ,
ap:associatedState <http://cv.ap.org/id/8083[Nebraska]43E>;
ap:displayLabel "Jordan Larson (Women's volleyball)"@en;
ap:hometown "Hooper, NE"@en;
ap:olympicTeam2016 <http://cv.ap.org/id/46[United States Olympic Team]B73H>;
ap:sport <http://cv.ap.org/id/DA[Volleyball]C8EA>;
dbprop:birthdate "1986-10-16"^^xsd:date;
dcterms:created "2012-07-11T14:30:26-04:00"^^xsd:dateTime;
dcterms:modified "2017-07-25T10:37:49-04:00"^^xsd:dateTime;
a <http://cv.ap.org/c/ProfessionalAthlete>, skos:Concept;
skos:broader <http://cv.ap.org/id/384[Professional Athlete]88>;
skos:definition "American volleyball player."@en;
skos:inScheme <http://cv.ap.org/a#person>;
skos:prefLabel "Jordan Larson"@en;
foaf:gender "Female"@en.
Applying taxonomy to text
Manually
apmetadata@ap.org
Airlines
Industry
Pan
American
Airlines Co.
Travel
<Hurricane Harvey>
(AND,
(MINOC_2,
(SENT,
(NOTIN,
(OR,"Harvey_C","HARVEY_C"),
(OR,"[Fullname
female]","[Fullname
male]","[Person]")),
(OR,"texas","landfall","storm",
"hurricane","nws","National weather
service","evacuate@","surge@","flood@",
"rain@N","coastal","sandbag@N"...
)
)
)...
Applying taxonomy to text
Rules-based classifier
apmetadata@ap.org
https://www.flickr.com/photos/notionscapital/15556898221/
Applying taxonomy to text
Statistical classifier
apmetadata@ap.org
Training data
Training engine Trained model
AP Metadata Services
Tag with AP taxonomy
APMS Custom Tagging
Simple four step REST API
Add your own tags and taxonomy
apmetadata@ap.org
Let’s create a classifier! For dragons
What if l like the AP Taxonomy
but I want to classify with some additional tags?
In this case, documents about dragons
A taxonomy of dragons
(borrowed from screencrush.com)
New documents about dragons
To be classified
A map (with some * )
A fully automated workflow
for training and deploying a
Lambda-based classifier
Sadly, the expression hic sunt
dracones (here be dragons) is an
anachronism, but it does appear
at least once, on the Hunt-Lenox
globe (ca 1510).
The Hunt-Lenox Globe (NYPL)
* Dragon emojis indicate problems found and (mostly) solved
Step
Functions
Client
EC2
Auto Scaling
Download training data
Download dependencies
Train model
Deploy model
EC2 classifier.py
classifier.pkl
tags.json
API Gateway
Lambda
Workflow Scaling Worker Classifier
apmetadata@ap.org
Creating a classifier
A Lambda-based classifier
• AWS Lambda: run event-driven code without provisioning or
managing a server or servers
•Cost efficient solution to ensure capacity meets demand
• What do we need?
• Code to invoke classifier and return results to user
• Code dependencies (e.g. scikit-learn)
• Other supporting artifacts (the trained model, the taxonomy)
• Permissions for Lambda function to interact with other AWS services
• API endpoint for accessing Lambda function
apmetadata@ap.org
Step
Functions
Client
EC2
Auto Scaling
Download training data
Download dependencies
Train model
Deploy model
EC2 classifier.py
classifier.pkl
tags.json
API Gateway
Lambda
Workflow Scaling Worker Classifier
apmetadata@ap.org
Processing user requests
Processing user requests
Validate and train
Adding complexity: a workflow for algorithm selection
AWS Step Functions: use visual workflows to coordinate microservices
into a single application
Triggers auto-scaling,
sends training request
to worker in the cloud.
apmetadata@ap.org
Step
Functions
Client
EC2
Auto Scaling
Download training data
Download dependencies
Train model
Deploy model
EC2 classifier.py
classifier.pkl
tags.json
API Gateway
Lambda
Workflow Scaling Worker Classifier
apmetadata@ap.org
Training and deploying
Training in the cloud
• AWS EC2: scalable computing capacity in the cloud
• Register an Amazon Machine Image (AMI) specifically for training
•Speeds up provisioning your server
• Ensures versions match between dependencies and your model
•Prepare dependencies ahead of time to beat AWS Lambda’s size limits
•If you are using scikit-learn, sklearn-build-lambda can generate an appropriately
sized zip
• Save model and taxonomy to disk, add to dependency zip
apmetadata@ap.org
Automating deployments
• Serverless Framework: Node.js
application for rapid deployment of
serverless architectures
• Simplifies the task of creating (and
deleting) our classifier Lambdas
•Provider agnostic, though you may
not be
•Zip artifact support for Lambda
creation
apmetadata@ap.org
Step
Functions
Client
EC2
Auto Scaling
Download training data
Download dependencies
Train model
Deploy model
EC2 classifier.py
classifier.pkl
tags.json
API Gateway
Lambda
Workflow Scaling Worker Classifier
apmetadata@ap.org
Classifying with AWS Lambda
Classifying with AWS Lambda
• Be mindful of cold starts
•Allocating more memory may help
• Store large models in S3 and take advantage of container reuse
•Download assets to /tmp
•Check /tmp for cached data before invocation
Item Limit
Deployment package (compressed) 50MB
Deployment package (uncompressed) 250MB
Non-persistent disk space in /tmp 500MB
apmetadata@ap.org
Predicted
Eagles
Predicted
Doves
Predicted
Pigeons
Sum of items
= 300
Actual
Eagles
95 3 2 100 Eagles
Actual
Doves
3 72 25 100 Doves
Actual
Pigeons
2 23 75 100 Pigeons
How do I measure results?
Confusion matrix
apmetadata@ap.org
How do I measure
results?
apmetadata@ap.org
Measure your model’s performance per class
• Precision (number of correct predictions divided by the total number in the dataset)
• Recall (number of correct positive predictions divided by the total number of positives)
Predicted
Eagles
Predicted
Doves
Predicted
Pigeons
Sum of items
= 300
Actual
Eagles
95 3 2 100 Eagles
Actual
Doves
3 72 25 100 Doves
Actual
Pigeons
2 23 75 100 Pigeons
Model accuracy:
242 / 300 = 80%
How do I improve results?
Training data
• Correctly tagged - quality matters
• Quantity matters too - as long as it’s ‘good’ data!
• Balanced training sets across classes
apmetadata@ap.org
How do I improve results?
Taxonomy
• Clean taxonomy nodes and structure
• Distinct semantics, use relationships
• Avoid overlapping concepts between nodes
apmetadata@ap.org
apmetadata@ap.org
Thank You!
dfox@ap.org
smyles@ap.org
vzielinska@ap.org
apmetadata@ap.org
Learn more about AP Metadata Services
https://developer.ap.org/ap-metadata-services

How to Train Your Classifier: Create a Serverless Machine Learning System with AWS and Python

  • 1.
    How to Train YourClassifier: Create a Serverless Machine Learning System with AWS and Python PyData ✤ November 27th, 2017 ✤ apmetadata@ap.org
  • 2.
  • 3.
    apmetadata@ap.org Tags Why do youwant tags on your text content? ● Search, navigation, recommendations ● Aggregation, routing ● Discoverability ○ properties ○ relationships
  • 4.
  • 5.
    apmetadata@ap.org Taxonomy Jordan Larson <http://cv.ap.org/id/9A7FD8FA87AD4A43BDD522B65147A808> , ap:associatedState<http://cv.ap.org/id/8083[Nebraska]43E>; ap:displayLabel "Jordan Larson (Women's volleyball)"@en; ap:hometown "Hooper, NE"@en; ap:olympicTeam2016 <http://cv.ap.org/id/46[United States Olympic Team]B73H>; ap:sport <http://cv.ap.org/id/DA[Volleyball]C8EA>; dbprop:birthdate "1986-10-16"^^xsd:date; dcterms:created "2012-07-11T14:30:26-04:00"^^xsd:dateTime; dcterms:modified "2017-07-25T10:37:49-04:00"^^xsd:dateTime; a <http://cv.ap.org/c/ProfessionalAthlete>, skos:Concept; skos:broader <http://cv.ap.org/id/384[Professional Athlete]88>; skos:definition "American volleyball player."@en; skos:inScheme <http://cv.ap.org/a#person>; skos:prefLabel "Jordan Larson"@en; foaf:gender "Female"@en.
  • 6.
    Applying taxonomy totext Manually apmetadata@ap.org Airlines Industry Pan American Airlines Co. Travel
  • 7.
  • 8.
    Applying taxonomy totext Statistical classifier apmetadata@ap.org Training data Training engine Trained model
  • 9.
    AP Metadata Services Tagwith AP taxonomy APMS Custom Tagging Simple four step REST API Add your own tags and taxonomy apmetadata@ap.org
  • 10.
    Let’s create aclassifier! For dragons What if l like the AP Taxonomy but I want to classify with some additional tags? In this case, documents about dragons
  • 11.
    A taxonomy ofdragons (borrowed from screencrush.com) New documents about dragons To be classified
  • 12.
    A map (withsome * ) A fully automated workflow for training and deploying a Lambda-based classifier Sadly, the expression hic sunt dracones (here be dragons) is an anachronism, but it does appear at least once, on the Hunt-Lenox globe (ca 1510). The Hunt-Lenox Globe (NYPL) * Dragon emojis indicate problems found and (mostly) solved
  • 13.
    Step Functions Client EC2 Auto Scaling Download trainingdata Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Creating a classifier
  • 14.
    A Lambda-based classifier •AWS Lambda: run event-driven code without provisioning or managing a server or servers •Cost efficient solution to ensure capacity meets demand • What do we need? • Code to invoke classifier and return results to user • Code dependencies (e.g. scikit-learn) • Other supporting artifacts (the trained model, the taxonomy) • Permissions for Lambda function to interact with other AWS services • API endpoint for accessing Lambda function apmetadata@ap.org
  • 15.
    Step Functions Client EC2 Auto Scaling Download trainingdata Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Processing user requests
  • 16.
    Processing user requests Validateand train Adding complexity: a workflow for algorithm selection AWS Step Functions: use visual workflows to coordinate microservices into a single application Triggers auto-scaling, sends training request to worker in the cloud. apmetadata@ap.org
  • 17.
    Step Functions Client EC2 Auto Scaling Download trainingdata Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Training and deploying
  • 18.
    Training in thecloud • AWS EC2: scalable computing capacity in the cloud • Register an Amazon Machine Image (AMI) specifically for training •Speeds up provisioning your server • Ensures versions match between dependencies and your model •Prepare dependencies ahead of time to beat AWS Lambda’s size limits •If you are using scikit-learn, sklearn-build-lambda can generate an appropriately sized zip • Save model and taxonomy to disk, add to dependency zip apmetadata@ap.org
  • 19.
    Automating deployments • ServerlessFramework: Node.js application for rapid deployment of serverless architectures • Simplifies the task of creating (and deleting) our classifier Lambdas •Provider agnostic, though you may not be •Zip artifact support for Lambda creation apmetadata@ap.org
  • 20.
    Step Functions Client EC2 Auto Scaling Download trainingdata Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Classifying with AWS Lambda
  • 21.
    Classifying with AWSLambda • Be mindful of cold starts •Allocating more memory may help • Store large models in S3 and take advantage of container reuse •Download assets to /tmp •Check /tmp for cached data before invocation Item Limit Deployment package (compressed) 50MB Deployment package (uncompressed) 250MB Non-persistent disk space in /tmp 500MB apmetadata@ap.org
  • 22.
    Predicted Eagles Predicted Doves Predicted Pigeons Sum of items =300 Actual Eagles 95 3 2 100 Eagles Actual Doves 3 72 25 100 Doves Actual Pigeons 2 23 75 100 Pigeons How do I measure results? Confusion matrix apmetadata@ap.org
  • 23.
    How do Imeasure results? apmetadata@ap.org Measure your model’s performance per class • Precision (number of correct predictions divided by the total number in the dataset) • Recall (number of correct positive predictions divided by the total number of positives) Predicted Eagles Predicted Doves Predicted Pigeons Sum of items = 300 Actual Eagles 95 3 2 100 Eagles Actual Doves 3 72 25 100 Doves Actual Pigeons 2 23 75 100 Pigeons Model accuracy: 242 / 300 = 80%
  • 24.
    How do Iimprove results? Training data • Correctly tagged - quality matters • Quantity matters too - as long as it’s ‘good’ data! • Balanced training sets across classes apmetadata@ap.org
  • 25.
    How do Iimprove results? Taxonomy • Clean taxonomy nodes and structure • Distinct semantics, use relationships • Avoid overlapping concepts between nodes apmetadata@ap.org
  • 26.
    apmetadata@ap.org Thank You! dfox@ap.org smyles@ap.org vzielinska@ap.org apmetadata@ap.org Learn moreabout AP Metadata Services https://developer.ap.org/ap-metadata-services