Build, train, and deploy ML models with Amazon SageMaker - AIM302 - New York AWS Summit

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building, training, and deploying ML
models with Amazon SageMaker
Emily Webber
Machine Learning Specialist
Amazon Web Services
A I M 3 0 2

For this workshop, you need
• A personal laptop
• An AWS account
• Access to Amazon SageMaker, Amazon S3, and Amazon ECS
• Credits will be provided to offset cost of AWS services charged to
your account for this workshop

Put machine learning in the
hands of every developer
Our mission at AWS
Our mission at AWS
Put machine learning in the
hands of every developer

The Amazon ML stack:
Broadest & deepest set of capabilities
AI SERVICES
Easily add intelligence to applications without machine learning skills
Vision | Documents | Speech | Language | Chatbots | Forecasting | Recommendations
ML SERVICES
Build, train, and deploy machine learning models quickly and easily
Data labeling | Pre-built algorithms & notebooks | One-click training and deployment
ML FRAMEWORKS &
INFRASTRUCTURE
Flexibility & choice, highest-performing infrastructure
Support for ML frameworks | Compute options purpose built for ML

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
1 2 3 5
I I I I
Notebook Instances
with 200+ Examples
17 Built-In Algorithms ML Training Service x 5 ML Hosting Service
4
I
Hyper-Parameter
Tuning Service x 2
1
1 + 𝑒−𝑥
6
Batch Transform
• Inferencing
• Data Transformation
7
SDK’s:
• Python
• Spark
8
Documentation &
whitepapers
& blog posts

Problem Statement
Healthcare insurance fraud is a pressing problem, causing substantial and
increasing costs in medical insurance programs.
Due to large amounts of claims submitted, review of individual claims becomes a
difficult task and encourages the employment of automated pre-payment controls
and better post-payment decision support tools to enable subject matter expert
analysis.
We will demonstrate the application of unsupervised anomalous outlier techniques
on a minimal set of metrics made available in the CMS Medicare inpatient claims
from 2008.

Dataset
CMS Medicare inpatient claims from 2008
• Medicare inpatient claims from 2008
• Each record is an inpatient claim incurred by a 5% sample of Medicare beneficiaries.
• Beneficiary identities are not provided
• Zip-code of facilities where patient was treated are not provided
• The file contains seven (8) variables. One primary key and seven analytic variables
• Data dictionary required to interpret codes in dataset are provided

Data Variables

Techniques and Algorithms used in the workshop
• Outlier Detection
• Word Embedding and Word2Vec
• Principal Component Analysis
• Calculate the Mahalanobis distance

Outlier Detection
“Observation which deviates so much from other observations as to arouse suspicion it was generated
by a different mechanism” —Hawkins(1980)
• Modeling normal objects and outlier effectively. The border between data normality and abnormality
(outliers) is often not clear cut.
• Outlier detection methods could be application specific. Example, in clinical data small deviation could
be an outlier, but, in a marketing application large deviation is required to justify an outlier.
• Noise in data may be present as deviations in attribute values or even as missing values. Noise may
hide an outlier or may flag deviation as an outlier.
• Providing justification for an outlier from understandability point of view may be difficult.
Challenges

Word Embedding and Word2Vec
Word Embedding
Word2Vec
• Texts converted into numbers and there may be different numerical representations of the
same text
• Many techniques exist. CBOW (Continuous Bag of words) and skip-gram are popular and
effective for large corpus of documents
• Example, CBOW predict the probability of a word given a context
• Shallow neural networks which map word(s) to the target variable which is also a word(s).
• Learn weights which act as word vector representations

Principal Component Analysis
When
• Do you want to reduce the number of variables, but aren’t able to identify variables to completely
remove from consideration?
• Do you want to ensure your variables are independent of one another?
• Are you comfortable making your independent variables less interpretable?
How
• A measure of how each variable is associated with one another. (Covariance matrix)
• The directions in which our data are dispersed. (Eigenvectors)
• The relative importance of these different directions. (Eigenvalues)

Isolation of different environment
Notebook Instance
Training
Inference
Amazon SageMaker provides you a clean separation between your Jupyter Notebook Instance, Training
environment instances and Inference environment instances by launching a new stack on new machine to
support below functions.
This is your IDE environment to write scripts and test out your algorithm on a small subset of data. Choose
lower size instances for this work.
This is your training environment that is created when you make a fit function call using SageMaker SDK
from within a Jupyter Notebook. Choose one or multiple lager instances to train your learning algorithm on
a large dataset. The environment is terminated automatically when training completes.
This is your hosting environment for your trained models. You can scale it out automatically by using
automatic scaling feature. You can choose to run smaller or larger size instances to meet your demand for
inference.

Complete the workshop
https://tinyurl.com/y3ch2kjg

Build, train, and deploy ML models with Amazon SageMaker - AIM302 - New York AWS Summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build, train, and deploy ML models with Amazon SageMaker - AIM302 - New York AWS Summit

Similar to Build, train, and deploy ML models with Amazon SageMaker - AIM302 - New York AWS Summit (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Build, train, and deploy ML models with Amazon SageMaker - AIM302 - New York AWS Summit