Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon sage maker infinitely scalable machine learning algorithms

470 views

Published on

Edo Liberty, Director of Research at AWS, head of Amazon AI Labs

Published in: Technology
  • Be the first to comment

Amazon sage maker infinitely scalable machine learning algorithms

  1. 1. Amazon SageMaker Algorithms Edo Liberty, Director of Amazon AI Labs Zohar Karnin, Bing Xiang, Baris Cuskon, Ramesh Nallapati, Phillip Gautier, Madhav Jha, Ran Ding,Tim Januschowski, David Selinas, BernieWang, Jan Gasthaus, Laurence Rouesnel, Amir Sadoughi, Piali Das, Julio Delgado Mangas,Yury Astashonok, Can Balioglu, Saswata Chakravarty, and Alex Smola
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon SageMaker? Exploration Training Hosting
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Large Scale Machine Learning
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Large Scale Machine Learning
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our Customers use ML at a massive scale! “We collect 160M events daily in the ML pipeline and run training over the last 15 days and need it to complete in one hour. Effectively there's 100M features in the model” Valentino Volonghi, CTO “We process 3 million ad requests a second, 100,000 features per request. That’s 250 trillion per day. Not your run of the mill Data science problem!” Bill Simmons, CTO “Our data warehouse is 100TB and we are processing 2TB daily. We're running mostly gradient boosting (trees), LDA and K- Means clustering and collaborative filtering.“ Shahar Cizer Kobrinsky, VP Architecture
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scalable Training Challenges • Competency • Handoff • Production Readiness • Model Selection • Model Freshness • Ephemeral data • Pause/Resume • Incremental Training • Stability • Predictability • Elasticity • Cost • Time • Accuracy • Scale • Data Access
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time $$$$ $$$ $$ $ Minutes Hours Days Weeks Months Single Machine
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time $$$$ $$$ $$ $ Minutes Hours Days Weeks Months Single Machine Ideal Case
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time $$$$ $$$ $$ $ Minutes Hours Days Weeks Months Single Machine Distributed, with Strong Machines Ideal Case
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time $$$$ $$$ $$ $ Minutes Hours Days Weeks Months Single Machine Distributed, with Strong Machines Ideal Case
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Model Selection 1 1
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Incremental Training 2 3 1 2
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Production Readiness Infeasible region Data/Model Size Investment Acceptable effort Required effort
  15. 15. Architecture and Design Choices
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming State
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stability + Predictability Data Size Memory Data Size Time/Cost
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Incremental Training 3 1 2
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time GPU State
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time GPU State GPU State GPU State
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. GPU GPU GPU Local State Shared State Local State Local State Cost vs. Time
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost vs. Time $$$$ $$$ $$ $ Minutes Hours Days Weeks Months Single Machine Distributed, with Strong Machines Ideal Case
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Production Readiness + Handoff SageMaker Training Container Management Optimized Machine Learning Base Container SageMaker Algorithms SDK Algorithms Logic
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Production Readiness + Handoff Infeasible region Data/Model Size Investment Acceptable effort Required effort
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Production Readiness + Handoff Data/Model Size Investment Acceptable effort Required effort No Infeasible region
  26. 26. Streaming Machine Learning - A Scientific Challenge
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Median Example Frugal Streaming for Estimating Quantiles: One (or two) memory suffices: Qiang Ma, S. Muthukrishnan, Mark Sandler
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Median Example
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Median Example sampling sketching Optimal Quantile Approximation in Streams Zohar Karnin, Kevin Lang, Edo Liberty
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Algorithms
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Linear Learner Regression: Estimate a real valued function Binary Classification: Predict a 0/1 class
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Linear Learner Train Fit thresholds and select Select model with best validation performance >8x speedup over naïve parallel training!
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Linear Learner Regression (mean squared error) SageMaker Other 1.02 1.06 1.09 1.02 0.332 0.183 0.086 0.129 83.3 84.5 Classification (F1 Score) SageMaker Other 0.980 0.981 0.870 0.930 0.997 0.997 0.978 0.964 0.914 0.859 0.470 0.472 0.903 0.908 0.508 0.508 30 GB datasets for web-spam and web-url classification 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20 25 30 CostinDollars Billable time in Minutes sagemaker-url sagemaker-spam other-url other-spam
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Factorization Machines Log_loss F1 Score Seconds SageMaker 0.494 0.277 820 Other (10 Iter) 0.516 0.190 650 Other (20 Iter) 0.507 0.254 1300 Other (50 Iter) 0.481 0.313 3250 Click Prediction 1 TB advertising dataset, m4.4xlarge machines, perfect scaling. $- $20.00 $40.00 $60.00 $80.00 $100.00 $120.00 $140.00 $160.00 $180.00 $200.00 1 2 3 4 5 6 7 8 CostinDollars Billable Time in Hours 10 machines 20 machines 30 machines 4050
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. K-Means Clustering
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. K-Means Clustering Method Accurate? Passes Efficient Tuning Comments Lloyds [1] Yes* 5-10 No K-Means ++ [2] Yes k+5 to k+10 No scikit-learn K-Means|| [3] Yes 7-12 No spark.ml Online [4] No 1 No Streaming [5,6] No 1 No Impractical Webscale [7] No 1 No spark streaming Coresets [8] No 1 Yes Impractical SageMaker Yes 1 Yes [1] Lloyd, IEEE TIT, 1982 [2] Arthur et. al. ACM-SIAM, 2007 [3] Bahmani et. al., VLDB, 2012 [4] Liberty et. al., 2015 [5] Shindler et. al, NIPS, 2011 [6] Guha et. al, IEEE Trans. Knowl. Data Eng. 2003 [7] Sculley, WWW, 2010 [8] Feldman et. al.
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 0 1 2 3 4 5 6 7 8 10 100 500BillableTimeinMinutes Number of Clusters sagemaker other K-Means Clustering k SageMaker Other Text 1.2GB 10 1.18E3 1.18E3 100 1.00E3 9.77E2 500 9.18.E2 9.03E2 Images 9GB 10 3.29E2 3.28E2 100 2.72E2 2.71E2 500 2.17E2 Failed Videos 27GB 10 2.19E2 2.18E2 100 2.03E2 2.02E2 500 1.86E2 1.85E2 Advertising 127GB 10 1.72E7 Failed 100 1.30E7 Failed 500 1.03E7 Failed Synthetic 1100GB 10 3.81E7 Failed 100 3.51E7 Failed 500 2.81E7 Failed Running Time vs. Number of Clusters ~10x Faster!
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Principal Component Analysis (PCA)
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Principal Component Analysis (PCA) More than 10x faster at a fraction the cost! 0.00 20.00 40.00 60.00 80.00 100.00 120.00 8 10 20 Mb/Sec/Machine Number of Machines other sagemaker-deterministic sagemaker-randomized Cost vs. Time Throughput and Scalability 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 5 10 15 20 25 30 35 40 45 CostinDollars Billable time in Minutesother sagemaker-deterministic sagemaker-randomized
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Time Series Forecasting Mean absolute percentage error P90 Loss DeepAR R DeepAR R traffic Hourly occupancy rate of 963 bay area freeways 0.14 0.27 0.13 0.24 electricity Electricity use of 370 homes over time 0.07 0.11 0.08 0.09 pageviews Page view hits of websites 10k 0.32 0.32 0.44 0.31 180k 0.32 0.34 0.29 NA One hour on p2.xlarge, $1 Input Network
  41. 41. Topic Modeling: Learning topics in a large document corpus What are Topic Models? • Unsupervised ML algorithm • A topic is a distribution over words in a vocabulary • Topics represented in terms of top 10 most likely words in the distribution • Words in a document is drawn from a mixture of topics • Documents can be soft-tagged with topics Use cases: • Discovering topics in the corpus automatically • Indexing documents by topics • Searching for similar documents
  42. 42. SageMaker Neural Topic Model (NTM) • Based on Variational Autoencoders • Encoder network: q(z|x): BOW  latent variables z • Decoder network: P(x|z): latent variables z  word distribution • Latent variables z represent topic distribution for the document 0 200 400 600 800 1000 1200 1400 LDA-Mean Field LDA-Gibbs NVDM GSM ProdLDA NTM Perplexity on 20NG data: Lower is better 0 0.05 0.1 0.15 0.2 0.25 0.3 LDA-Mean Field LDA-Gibbs NVDM GSM ProdLDA NTM Topic Coherence (NMPI) on 20NG data: Higher is better NTM offers a good balance between perplexity and topic coherence
  43. 43. NTM: Representative topics Human Assigned Topic Label Top words from topics in 20 News Groups Data Religion jesus, scripture, christian, religion, belief, islam, god, christianity, atheism, christ Sports scoring, team, season, playoff, win, scorer, detroit, game, league, nhl Computer Hardware ide, scsi, controller, scsi-2, drive, simms, scsi-1, isa, motherboard, floppy Computer Security encryption, escrow, encrypted, rsa, crypto, secure, algorithm, key, nsa, clipper Mechanics tire, noise, rear, engine, lock, brake, inch, radar, mile, detector Human Assigned Topic Label WikiText–103 dataset Navy admiral, fleet, cruiser, hm, austro, battleship, dreadnought, ship, battlecruisers, squadron Biology protein, genetic, enzyme, dna, gene, disease, rna, molecule, organism, bacteria] Games enix, video, remix, xbox, remixes, d, remixed, downloads, playstation, nintendo Films film, filming, script, animated, filmed, animation, episode, screenplay, disney, movie Music liner, recording, guitarist, beatles, orchestra, musician, opera, studio, band, concert
  44. 44. Object2Vec: Learning embeddings of high dimensional objects • Learns embeddings of entity pairs • Token pairs • Sequence pairs • Token-sequence pairs • Preserves semantic relationship between entities in each pair in the embedding space • Learned embeddings can be used: • For nearest neighbor search • For clustering and visualization • As features in downstream tasks Left Input Left Encoder Comparator Label Right Input Right Encoder Can be trained using Cross-Entropy Loss, MSE Encoders can be layers of Pooled Embeddings/CNNs/RNNs; Left-right can be asymmetric Inputs can be tokens, or sequences of tokens Combination of Hadamard Product, Absolute difference, Concatenation, followed by FF network
  45. 45. Object2Vec: Benchmarking 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 RMSE MovieLens Ratings Prediction 50 55 60 65 70 75 80 85 90 InferSent Object2Vec InferSent Object2Vec CNN RNN Accuracy Stanford Natural Language Inference Prediction of relationship between token pairs: Movie recommendation Prediction of relationship between sequence pairs: Natural Language Inference Prediction of similarity between embeddings of pairs of sequences: Sentence similarity 0.5 0.55 0.6 0.65 0.7 0.75 STS'12 STS'13 STS`14 STS`15 STS'16 PearsonCorrelation Semantic Text Similarity PooledEmbeddings InferSent Object2Vec Prediction of relationship between sequences and tokens: Multi-label document classification
  46. 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pipe Mode (Made available May 23rd) PCA K-Means Throughput Job Startup Time Job Execution Time
  47. 47. Using Amazon SageMaker
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. From Amazon SageMaker Notebooks Parameters Hardware Start Training Host model
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. From Amazon EMR Start Training Parameters Hardware Apply Model
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Input Data profile=<your_profile> arn_role=<your_arn_role> training_image=382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1 training_job_name=clutering_text_documents_`date '+%Y_%m_%d_%H_%M_%S'` aws --profile $profile --region us-east-1 sagemaker create-training-job --training-job-name $training_job_name --algorithm-specification TrainingImage=$training_image,TrainingInputMode=File --hyper-parameters k=10,feature_dim=1024,mini_batch_size=1000 --role-arn $arn_role --input-data-config '{"ChannelName": "train", "DataSource": {"S3DataSource":{"S3DataType": "S3Prefix", "S3Uri": "s3://kmeans_demo/train", "S3DataDistributionType": "ShardedByS3Key"}}, "CompressionType": "None", "RecordWrapperType": "None"}' --output-data-config S3OutputPath=s3://training_output/$training_job_name --resource-config InstanceCount=2,InstanceType=ml.c4.8xlarge,VolumeSizeInGB=50 --stopping-condition MaxRuntimeInSeconds=3600 From Command Line Hardware Algorithm
  51. 51. Thank you Edo Liberty, Director of Amazon AI Labs Zohar Karnin, Bing Xiang, Baris Cuskon, Ramesh Nallapati, Phillip Gautier, Madhav Jha, Ran Ding, Tim Januschowski, David Selinas, Bernie Wang, Jan Gasthaus, Laurence Rouesnel, Amir Sadoughi, Piali Das, Julio Delgado Mangas, Yury Astashonok, Can Balioglu, Saswata Chakravarty, and Alex Smola

×