Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Train Your Classifier: Create a Serverless Machine Learning System with AWS and Python

346 views

Published on

How to train a custom tagger to classify text using scikit-learn, with practical tuning advice to get more accurate results. How to create a REST API to train and host your tagger using AWS services including Lambda, API Gateway and Step Functions. Tips on how to overcome limitations in AWS and scikit-learn when creating your own custom tagger.

Presented at PyData NYC 2017 by Stuart Myles, Veronika Zielinska and David Fox
https://pydata.org/nyc2017/schedule/presentation/21/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to Train Your Classifier: Create a Serverless Machine Learning System with AWS and Python

  1. 1. How to Train Your Classifier: Create a Serverless Machine Learning System with AWS and Python PyData ✤ November 27th, 2017 ✤ apmetadata@ap.org
  2. 2. Classification Parrots Sandwiches apmetadata@ap.org
  3. 3. apmetadata@ap.org Tags Why do you want tags on your text content? ● Search, navigation, recommendations ● Aggregation, routing ● Discoverability ○ properties ○ relationships
  4. 4. apmetadata@ap.org Taxonomy
  5. 5. apmetadata@ap.org Taxonomy Jordan Larson <http://cv.ap.org/id/9A7FD8FA87AD4A43BDD522B65147A808> , ap:associatedState <http://cv.ap.org/id/8083[Nebraska]43E>; ap:displayLabel "Jordan Larson (Women's volleyball)"@en; ap:hometown "Hooper, NE"@en; ap:olympicTeam2016 <http://cv.ap.org/id/46[United States Olympic Team]B73H>; ap:sport <http://cv.ap.org/id/DA[Volleyball]C8EA>; dbprop:birthdate "1986-10-16"^^xsd:date; dcterms:created "2012-07-11T14:30:26-04:00"^^xsd:dateTime; dcterms:modified "2017-07-25T10:37:49-04:00"^^xsd:dateTime; a <http://cv.ap.org/c/ProfessionalAthlete>, skos:Concept; skos:broader <http://cv.ap.org/id/384[Professional Athlete]88>; skos:definition "American volleyball player."@en; skos:inScheme <http://cv.ap.org/a#person>; skos:prefLabel "Jordan Larson"@en; foaf:gender "Female"@en.
  6. 6. Applying taxonomy to text Manually apmetadata@ap.org Airlines Industry Pan American Airlines Co. Travel
  7. 7. <Hurricane Harvey> (AND, (MINOC_2, (SENT, (NOTIN, (OR,"Harvey_C","HARVEY_C"), (OR,"[Fullname female]","[Fullname male]","[Person]")), (OR,"texas","landfall","storm", "hurricane","nws","National weather service","evacuate@","surge@","flood@", "rain@N","coastal","sandbag@N"... ) ) )... Applying taxonomy to text Rules-based classifier apmetadata@ap.org https://www.flickr.com/photos/notionscapital/15556898221/
  8. 8. Applying taxonomy to text Statistical classifier apmetadata@ap.org Training data Training engine Trained model
  9. 9. AP Metadata Services Tag with AP taxonomy APMS Custom Tagging Simple four step REST API Add your own tags and taxonomy apmetadata@ap.org
  10. 10. Let’s create a classifier! For dragons What if l like the AP Taxonomy but I want to classify with some additional tags? In this case, documents about dragons
  11. 11. A taxonomy of dragons (borrowed from screencrush.com) New documents about dragons To be classified
  12. 12. A map (with some * ) A fully automated workflow for training and deploying a Lambda-based classifier Sadly, the expression hic sunt dracones (here be dragons) is an anachronism, but it does appear at least once, on the Hunt-Lenox globe (ca 1510). The Hunt-Lenox Globe (NYPL) * Dragon emojis indicate problems found and (mostly) solved
  13. 13. Step Functions Client EC2 Auto Scaling Download training data Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Creating a classifier
  14. 14. A Lambda-based classifier • AWS Lambda: run event-driven code without provisioning or managing a server or servers •Cost efficient solution to ensure capacity meets demand • What do we need? • Code to invoke classifier and return results to user • Code dependencies (e.g. scikit-learn) • Other supporting artifacts (the trained model, the taxonomy) • Permissions for Lambda function to interact with other AWS services • API endpoint for accessing Lambda function apmetadata@ap.org
  15. 15. Step Functions Client EC2 Auto Scaling Download training data Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Processing user requests
  16. 16. Processing user requests Validate and train Adding complexity: a workflow for algorithm selection AWS Step Functions: use visual workflows to coordinate microservices into a single application Triggers auto-scaling, sends training request to worker in the cloud. apmetadata@ap.org
  17. 17. Step Functions Client EC2 Auto Scaling Download training data Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Training and deploying
  18. 18. Training in the cloud • AWS EC2: scalable computing capacity in the cloud • Register an Amazon Machine Image (AMI) specifically for training •Speeds up provisioning your server • Ensures versions match between dependencies and your model •Prepare dependencies ahead of time to beat AWS Lambda’s size limits •If you are using scikit-learn, sklearn-build-lambda can generate an appropriately sized zip • Save model and taxonomy to disk, add to dependency zip apmetadata@ap.org
  19. 19. Automating deployments • Serverless Framework: Node.js application for rapid deployment of serverless architectures • Simplifies the task of creating (and deleting) our classifier Lambdas •Provider agnostic, though you may not be •Zip artifact support for Lambda creation apmetadata@ap.org
  20. 20. Step Functions Client EC2 Auto Scaling Download training data Download dependencies Train model Deploy model EC2 classifier.py classifier.pkl tags.json API Gateway Lambda Workflow Scaling Worker Classifier apmetadata@ap.org Classifying with AWS Lambda
  21. 21. Classifying with AWS Lambda • Be mindful of cold starts •Allocating more memory may help • Store large models in S3 and take advantage of container reuse •Download assets to /tmp •Check /tmp for cached data before invocation Item Limit Deployment package (compressed) 50MB Deployment package (uncompressed) 250MB Non-persistent disk space in /tmp 500MB apmetadata@ap.org
  22. 22. Predicted Eagles Predicted Doves Predicted Pigeons Sum of items = 300 Actual Eagles 95 3 2 100 Eagles Actual Doves 3 72 25 100 Doves Actual Pigeons 2 23 75 100 Pigeons How do I measure results? Confusion matrix apmetadata@ap.org
  23. 23. How do I measure results? apmetadata@ap.org Measure your model’s performance per class • Precision (number of correct predictions divided by the total number in the dataset) • Recall (number of correct positive predictions divided by the total number of positives) Predicted Eagles Predicted Doves Predicted Pigeons Sum of items = 300 Actual Eagles 95 3 2 100 Eagles Actual Doves 3 72 25 100 Doves Actual Pigeons 2 23 75 100 Pigeons Model accuracy: 242 / 300 = 80%
  24. 24. How do I improve results? Training data • Correctly tagged - quality matters • Quantity matters too - as long as it’s ‘good’ data! • Balanced training sets across classes apmetadata@ap.org
  25. 25. How do I improve results? Taxonomy • Clean taxonomy nodes and structure • Distinct semantics, use relationships • Avoid overlapping concepts between nodes apmetadata@ap.org
  26. 26. apmetadata@ap.org Thank You! dfox@ap.org smyles@ap.org vzielinska@ap.org apmetadata@ap.org Learn more about AP Metadata Services https://developer.ap.org/ap-metadata-services

×