7. What is SageMaker?
● Machine Learning as a Service
● A cloud-based training and deployment framework
● Hosted notebooks for interactive development
● A set of optimised algorithms and open-source containers
Notebook Jobs Models Endpoint
8. Jupyter Notebooks
● Open source web app
● Create and share documents
● live code, visualisations,
documentation
● Data exploration, cleaning &
transformation
● Statistical modelling
● Numerical simulation
● … and machine learning
10. Flexible Storage
● Share workspaces with teams
● Persist data between sessions/instances
● Availability & reliability
● Dynamically scalable storage
● Large file support
● Read-after-write consistency
● Low-ish latency
● High throughput
12. Clean, Fetch, Prepare Big Data (>100GB)
● Kinesis Firehose & Data Pipeline - ETL
● Glue - Catalogue, Crawl S3, ETL
● EMR/Hadoop/YARN - Distributed Compute and Storage
● Spark - Data Science Friendly Distributed Compute
● Athena / Redshift Spectrum - Use SQL with S3 data
13. Training in SageMaker
● Scalable Algorithms
● Streaming Data
● Incremental Training
● Containerized
● Accelerated Computing
14. Out of the Box Algorithms
Supervised Learning
LinearLearner
XGBoost
FactorisationMachine
Forecasting
Deep AR
Text Mining
BlazingText (word2vec)
Neural Topic Modelling
Latent Dirichlet Allocation
seq2seq
Computer Vision
ImageClassification
Anomaly Detection
Random Cut Forest
Unsupervised Learning
K Means
PCA
21. Notebooks vs Distributed Training
scikit-learn
● 1 month of data (2GB)
● 13 million rows
● 7 minutes to train (50c)
● 1 x m4.16xlarge
● sklearn.cluster.KMeans
Sagemaker
● 1 year of data (24GB)
● 172 million rows
● 10 minutes to train ($3)
● 4 x m4.16xlarge
● sagemaker.KMeans
22. Out-of-the-Box vs Custom Algorithms
Out-of-the-Box (SageMaker k-means)
● Pay only for hours used
● Data lives in S3
● Data Scientist provisions
compute
● Easy to host models
● Limited Model Options
Custom (eg. scikit-learn)
● Use any model you like
● Distributed Training is
Hard
● Convert models
● Build Containers
34. 6 Months of SageMaker
● Very large datasets
● Incremental training
● Scaling “Time to Solution”
● Simple deployment of models
● Easy hosted notebooks
● No auto hyperparameter
tuning
● Local mode needs
improvement
● Evolving TensorBoard support
Notebook Jobs Models Endpoint
38. Alternatives
Google CloudML
● Integrated into GCP
● Efficient hyperparameter
search
● Mature (as much as things
are in ML)
KubeFlow
● Defines TfJob in K8s
● Allows easy creation of a TF
cluster
● Immature (but promising)
● Cloud Agnostic