Sagemaker Brownbag

•

0 likes•46 views

Ray Hilton

Overview of AWS Sagemaker

Technology

Introduction
Applied
Machine Learning
Data Engineering Training & EducationConversational Interfaces

What is SageMaker?
● Machine Learning as a Service
● A cloud-based training and deployment framework
● Hosted notebooks for interactive development
● A set of optimised algorithms and open-source containers
Notebook Jobs Models Endpoint

Jupyter Notebooks
● Open source web app
● Create and share documents
● live code, visualisations,
documentation
● Data exploration, cleaning &
transformation
● Statistical modelling
● Numerical simulation
● … and machine learning

Flexible Storage
● Share workspaces with teams
● Persist data between sessions/instances
● Availability & reliability
● Dynamically scalable storage
● Large file support
● Read-after-write consistency
● Low-ish latency
● High throughput

Clean, Fetch, Prepare Big Data (>100GB)
● Kinesis Firehose & Data Pipeline - ETL
● Glue - Catalogue, Crawl S3, ETL
● EMR/Hadoop/YARN - Distributed Compute and Storage
● Spark - Data Science Friendly Distributed Compute
● Athena / Redshift Spectrum - Use SQL with S3 data

Training in SageMaker
● Scalable Algorithms
● Streaming Data
● Incremental Training
● Containerized
● Accelerated Computing

Out of the Box Algorithms
Supervised Learning
LinearLearner
XGBoost
FactorisationMachine
Forecasting
Deep AR
Text Mining
BlazingText (word2vec)
Neural Topic Modelling
Latent Dirichlet Allocation
seq2seq
Computer Vision
ImageClassification
Anomaly Detection
Random Cut Forest
Unsupervised Learning
K Means
PCA

NYC Taxicab Dataset
● 1.3 Billion Taxi Trips
● 2009 Onwards
● 330 GB
● Pickup longitude and latitude provided
● AWS Public Dataset (Sits in S3)

Notebooks vs Distributed Training
scikit-learn
● 1 month of data (2GB)
● 13 million rows
● 7 minutes to train (50c)
● 1 x m4.16xlarge
● sklearn.cluster.KMeans
Sagemaker
● 1 year of data (24GB)
● 172 million rows
● 10 minutes to train ($3)
● 4 x m4.16xlarge
● sagemaker.KMeans

Out-of-the-Box vs Custom Algorithms
Out-of-the-Box (SageMaker k-means)
● Pay only for hours used
● Data lives in S3
● Data Scientist provisions
compute
● Easy to host models
● Limited Model Options
Custom (eg. scikit-learn)
● Use any model you like
● Distributed Training is
Hard
● Convert models
● Build Containers

CIFAR-10
● 60000 32x32 colour
images
● 10 classes
● 6000 images per class
● 50000 training images
● 10000 test images

Local Limitations
● Compute
● Storage
● Memory
● Scaling
● Reliability
● Availability

More Data,
More Iterations,
More Compute

Time vs Cost for 1m Iterations
$2228
>1000 hours!

6 Months of SageMaker
● Very large datasets
● Incremental training
● Scaling “Time to Solution”
● Simple deployment of models
● Easy hosted notebooks
● No auto hyperparameter
tuning
● Local mode needs
improvement
● Evolving TensorBoard support
Notebook Jobs Models Endpoint

Alternatives
Google CloudML
● Integrated into GCP
● Efficient hyperparameter
search
● Mature (as much as things
are in ML)
KubeFlow
● Defines TfJob in K8s
● Allows easy creation of a TF
cluster
● Immature (but promising)
● Cloud Agnostic

What's hot

Getting Started with BigQuery MLDan Sullivan, Ph.D.

Ruby onrails overviewPiyush Chand

Graph Databases at NetflixIoannis Papapanagiotou

MongoDB SF RubyMike Dirolf

Mashing the dataFelix Crisan

Running DSpace: Technical overview, lessons learned, workflows and essential ...ILRI

Oasis montaj workshop session 1Amin khalil

Cassandra data accesstechblog

NoSQL Talk at eBuddyAliaksandr Kazlou

Scaling drupal on amazon web services drTristan Roddis

Handle TBs with $1500 per monthHung Lin

Netflix Data Benchmark @ HPTS 2017Ioannis Papapanagiotou

Data2breakfast - Introduction à la base de données NoSQL Apache CassandraData2B

Introducing MagnetoDB, a key-value storage sevice for OpenStackMirantis

Future of ai on the jvmAdam Gibson

Mongo dbSurendra Nath Sahoo

Overkill Analytics Seattle Spark MeetupClaudiu Barbura

Cassandra Lunch #59 Functions in CassandraAnant Corporation

Azure intoduksjon for it pro 02 data protection publicMorgan Simonsen

Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB

What's hot (20)

Getting Started with BigQuery ML

Ruby onrails overview

Graph Databases at Netflix

MongoDB SF Ruby

Mashing the data

Running DSpace: Technical overview, lessons learned, workflows and essential ...

Oasis montaj workshop session 1

Cassandra data access

NoSQL Talk at eBuddy

Scaling drupal on amazon web services dr

Handle TBs with $1500 per month

Netflix Data Benchmark @ HPTS 2017

Data2breakfast - Introduction à la base de données NoSQL Apache Cassandra

Introducing MagnetoDB, a key-value storage sevice for OpenStack

Future of ai on the jvm

Mongo db

Overkill Analytics Seattle Spark Meetup

Cassandra Lunch #59 Functions in Cassandra

Azure intoduksjon for it pro 02 data protection public

Empowering the AWS DynamoDB™ application developer with Alternator

Similar to Sagemaker Brownbag

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...Anna Ossowski

201908 Overview of Automated MLMark Tabladillo

Summer 2017 undergraduate research powerpointChristopher Dubois

Machine Learning and AIJames Serra

ML Infrastracture @ Dropbox Tsahi Glik

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX

Nikhil summer internship 2016Nikhil Shekhar

Machine learning and big data @ uber a tale of two systemsZhenxiao Luo

Public Cloud WorkshopAmer Ather

Optimizing spark based data pipelines - are you up for it?Etti Gur

Amazon Athena Hands-On WorkshopDoiT International

Running Cassandra in AWSDataStax Academy

Time Series Analytics Azure ADXRiccardo Zamana

Automating using AnsibleAlok Patra

Netflix Container Scheduling and Execution - QCon New York 2016aspyker

Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila

Similar to Sagemaker Brownbag (20)

AWS Big Data Demystified #1: Big data architecture lessons learned

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Big Data in 200 km/h | AWS Big Data Demystified #1.3

NetflixOSS Meetup season 3 episode 1

[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...

201908 Overview of Automated ML

Summer 2017 undergraduate research powerpoint

Machine Learning and AI

ML Infrastracture @ Dropbox

Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...

Nikhil summer internship 2016

Machine learning and big data @ uber a tale of two systems

Public Cloud Workshop

Optimizing spark based data pipelines - are you up for it?

Amazon Athena Hands-On Workshop

Running Cassandra in AWS

Time Series Analytics Azure ADX

Automating using Ansible

Netflix Container Scheduling and Execution - QCon New York 2016

Scheduling a fuller house - Talk at QCon NY 2016

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

AI as an Interface for Commercial BuildingsMemoori

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

How to convert PDF to text with Nanonetsnaman860154

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

GenCyber Cyber Security Day PresentationMichael W. Hawkins

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application

AI as an Interface for Commercial Buildings

Advanced Test Driven-Development @ php[tek] 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

How to convert PDF to text with Nanonets

Breaking the Kubernetes Kill Chain: Host Path Mount

Presentation on how to chat with PDF using ChatGPT code interpreter

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Injustice - Developers Among Us (SciFiDevCon 2024)

GenCyber Cyber Security Day Presentation

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Pigging Solutions in Pet Food Manufacturing

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Sagemaker Brownbag

1. Introduction to AWS SageMaker

2. Introduction Applied Machine Learning Data Engineering Training & EducationConversational Interfaces

3. Machine Learning Overview

6. Machine Learning Workflow

7. What is SageMaker? ● Machine Learning as a Service ● A cloud-based training and deployment framework ● Hosted notebooks for interactive development ● A set of optimised algorithms and open-source containers Notebook Jobs Models Endpoint

8. Jupyter Notebooks ● Open source web app ● Create and share documents ● live code, visualisations, documentation ● Data exploration, cleaning & transformation ● Statistical modelling ● Numerical simulation ● … and machine learning

9. Flexible Compute

10. Flexible Storage ● Share workspaces with teams ● Persist data between sessions/instances ● Availability & reliability ● Dynamically scalable storage ● Large file support ● Read-after-write consistency ● Low-ish latency ● High throughput

11. Big Data

12. Clean, Fetch, Prepare Big Data (>100GB) ● Kinesis Firehose & Data Pipeline - ETL ● Glue - Catalogue, Crawl S3, ETL ● EMR/Hadoop/YARN - Distributed Compute and Storage ● Spark - Data Science Friendly Distributed Compute ● Athena / Redshift Spectrum - Use SQL with S3 data

13. Training in SageMaker ● Scalable Algorithms ● Streaming Data ● Incremental Training ● Containerized ● Accelerated Computing

14. Out of the Box Algorithms Supervised Learning LinearLearner XGBoost FactorisationMachine Forecasting Deep AR Text Mining BlazingText (word2vec) Neural Topic Modelling Latent Dirichlet Allocation seq2seq Computer Vision ImageClassification Anomaly Detection Random Cut Forest Unsupervised Learning K Means PCA

15. Exercising SageMaker

16. The Problem

17. NYC Taxicab Dataset ● 1.3 Billion Taxi Trips ● 2009 Onwards ● 330 GB ● Pickup longitude and latitude provided ● AWS Public Dataset (Sits in S3)

18. Demo: Clustering

19.

20.

21. Notebooks vs Distributed Training scikit-learn ● 1 month of data (2GB) ● 13 million rows ● 7 minutes to train (50c) ● 1 x m4.16xlarge ● sklearn.cluster.KMeans Sagemaker ● 1 year of data (24GB) ● 172 million rows ● 10 minutes to train ($3) ● 4 x m4.16xlarge ● sagemaker.KMeans

22. Out-of-the-Box vs Custom Algorithms Out-of-the-Box (SageMaker k-means) ● Pay only for hours used ● Data lives in S3 ● Data Scientist provisions compute ● Easy to host models ● Limited Model Options Custom (eg. scikit-learn) ● Use any model you like ● Distributed Training is Hard ● Convert models ● Build Containers

23. TensorFlow

24. Image Classification

25. CIFAR-10 ● 60000 32x32 colour images ● 10 classes ● 6000 images per class ● 50000 training images ● 10000 test images

26. Demo: Train Locally

27. Local Limitations ● Compute ● Storage ● Memory ● Scaling ● Reliability ● Availability

28. More Data, More Iterations, More Compute

29. Demo: Train Remotely

30. Performance vs Cost

31. Time vs Cost for 1m Iterations $2228 >1000 hours!

32. Steps per $

33. Review

34. 6 Months of SageMaker ● Very large datasets ● Incremental training ● Scaling “Time to Solution” ● Simple deployment of models ● Easy hosted notebooks ● No auto hyperparameter tuning ● Local mode needs improvement ● Evolving TensorBoard support Notebook Jobs Models Endpoint

35. Questions?

36. Thanks

37. Pricing

38. Alternatives Google CloudML ● Integrated into GCP ● Efficient hyperparameter search ● Mature (as much as things are in ML) KubeFlow ● Defines TfJob in K8s ● Allows easy creation of a TF cluster ● Immature (but promising) ● Cloud Agnostic

Sagemaker Brownbag

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sagemaker Brownbag

Similar to Sagemaker Brownbag (20)

Recently uploaded

Recently uploaded (20)

Sagemaker Brownbag