More Related Content Similar to ML Workflows with Amazon SageMaker and AWS Step Functions (API325) - AWS re:Invent 2018 (20) More from Amazon Web Services (20) ML Workflows with Amazon SageMaker and AWS Step Functions (API325) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine Learning Workflows with
Amazon SageMaker and AWS Step Functions
A P I 3 2 5
Tom Faulhaber
Principal Engineer, AI Platforms
Amazon Web Services
Jeremy Irwin
Solution Architect
Cox Automotive Inc.
Andy Katz
Sr. Product Manager
Amazon Web Services
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Today’s agenda
Build, train, and deploy machine learning models with
Amazon SageMaker
Build serverless workflows with less code to write and maintain using
AWS Step Functions
Learn how Cox Automotive combined SageMaker and Step Functions to
improve collaboration between data scientists and software engineers
New features to build and manage ML workflows even faster
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SageMaker manages ML infrastructure
Build Train Deploy
Pre-built notebook
instances
Highly optimized
machine learning
algorithms
One-click training for ML,
deep learning, and custom
algorithms
Automatic model tuning
(hyperparameter
optimization)
Fully managed
hosting at scale
Deployment without
engineering effort
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customers building and deploying on SageMaker
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine learning cycle
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manage data on AWS
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build and train models using SageMaker
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deploy models using SageMaker
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What about the lines between the steps?
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is Step Functions?
Task
Choice
Fail
Parallel
Mountains
People
Snow
NotSupportedImageType
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Step Functions uses Amazon States Language (JSON)
{
"Comment": "Image Processing workflow",
"StartAt": "ExtractImageMetadata",
"States": {
"ExtractImageMetadata": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:photo-backendExtractImageMetadata-...",
"InputPath": "$",
"ResultPath": "$.extractedMetadata",
"Next": "ImageTypeCheck",
"Catch": [ {
"ErrorEquals": [ "ImageIdentifyError"],
"Next": "NotSupportedImageType"
} ],
"Retry": [ {
"ErrorEquals": [ "States.ALL"],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 1.5 }, ...
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Run tasks with any compute resource
Activity
Worker
long poll
Traditional server
AWS Lambda function
Synchronous
request
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customers running workflows on Step Functions
17. “Back to our story…”
Amazon
SageMaker
AWS Step
Functions
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine learning cycle
Business
Problem
ML problem
framing
Data collection
Data integration
Data preparation
and cleaning
Data visualization
and analysis
Feature
engineering
Model training and
parameter tuning
Model evaluation
Monitoring and
debugging
Model deployment
Predictions
YESNO
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cox Automotive
OUR VISION
TRANSFORM THE WAY THE WORLD
BUYS, SELLS, OWNS, AND USES
CARS
20. “As Data Scientists, one of our biggest concerns with ML is that over
time the models learn bad behaviors from spoiled data.
We need to interject human expert oversight in our model
deployment process, in order to continuously deliver quality
models with minimal human intervention.”
Jeff Keller, Senior Decision Scientist
Cox Automotive
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Digital advertising recommendations
Enable car dealers to make better
informed digital advertising
decisions
At Cox Automotive, ML-related
product development is
bifurcated:
• Decision Science builds
prediction models
• Engineering integrates models
into applications used by Cox
Automotive clients
Challenge: How can we
reduce the friction
between Data Science and
Engineering so that both
teams’ needs are fulfilled?
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Engineering != Decision Science
Background: Computer Science
Skills: automation, deployment,
reusability, Java
Imperatives: security, operability,
scalability
Background: Statistics
Skills: statistics, modeling, analysis,
R, Python
Imperatives: accuracy, precision,
interpretability
Cadence: 2 week sprints Cadence: varies
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Compute Blog: Starting point
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon SageMaker model deployment pipeline
VPC VPC
Event
Data Scientist
Email
Requirements
• Model artifacts are created
as .zip files
• Models are created as
.tar.gz files
Configurable Parameters
• Source S3 buckets (landing
zone for newly built
models)
• Destination S3 buckets
(Engineering-owned)
• Email address
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Step Functions state machine definition
…
"StartAt": "GetNewModel",
"States": {
"GetNewModel": {
"Type": "Task",
"Resource": "arn:aws:lambda:${region}:${act}:
function:model-review-GetNewModelFunction",
"ResultPath": "$",
"Next": "GetManualReview"
},
"GetManualReview": {
"Type": "Task",
"Resource": "arn:aws:states:${region}:${act}:
activity:model-review-getModelReviewDecision",
"ResultPath": "$.taskresult",
"TimeoutSeconds": 604800,
"Next": "ApproveOrRejectNewModel”
},
…
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State machine activity workers
Call-work-respond: An external worker gets
token, does work, and updates activity with
success or failure
Call-work-delegate…respond: Our external
worker gets the token and then delegates
responsibility for updating the activity to
downstream AWS services
Traditional server
GetActivityTask
JSON Input
+
TaskToken
Traditional server
SendTaskSuccess
JSON Result
+
TaskToken
Delegate
TaskToken
SendTaskSuccess
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activity token journey: Send models for review
taskToken = getActivityTaskResponse['taskToken’]
sendEmail(taskToken, diagnosticsFileName,
diagnosticsFile, diagnosticsFilePath, apiUrl)
…
def sendEmail(taskToken, diagnosticsFileName,
diagnosticsFile, diagnosticsFilePath, apiUrl):
sesClient = boto3.client('ses')
encodedtaskToken = quote(taskToken, safe='')
approveLink = apiUrl + '/approve/' + encodedtaskToken
rejectLink = apiUrl + '/reject/' + encodedtaskToken
Data Scientist
Event
sfnClient = boto3.client('stepfunctions')
getActivityTaskResponse = sfnClient.get_activity_task(
activityArn=activityArn,
workerName='checkStateMachineActivityStatus’
)
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activity token journey: Generate review request
Data Scientist
Event
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activity token journey: Amazon API Gateway
configuration
Data Scientist
Event
GetReviewDecisionFunction:
handler: handler.getReviewDecision
role: "${self:custom.terraformed.service.role}"
events:
- http:
path: approve/{taskToken}
method: get
request:
parameters:
paths:
taskToken: true
- http:
path: reject/{taskToken}
method: get
request:
parameters:
paths:
taskToken: true
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activity token journey: Prepare arguments & output
path = event['path']
taskToken = unquote(event['pathParameters']['taskToken'])
taskSuccessOutput = '{"decision": "Approved"}'
taskFailureOutput = '{"decision": "Rejected"}'
if path.startswith('/reject'):
message = "The model has been rejected and will not be promoted"
status = 'rejected'
kwargs = {
'taskToken': taskToken,
'output': taskFailureOutput
}
else:
if path.startswith('/approve'):
message = "The model has been approved and will be promoted"
status = 'approved'
kwargs = {
'taskToken': taskToken,
'output': taskSuccessOutput
}
else:
message = "The parameter does not match the expected parameter"
print(message)
Data Scientist
Event
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activity token journey: Set activity status
try:
if status == 'approved':
sfnClient.send_task_success(**kwargs)
responseData = {
"statusCode": 200,
"body": json.dumps({"decision": message})
}
else:
if status == 'rejected':
sfnClient.send_task_success(**kwargs)
responseData = {
"statusCode": 200,
"body": json.dumps({"decision": message})
}
except Exception as e:
raise e
return responseData
Data Scientist
Event
34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State input & output processing
Lambda state can be shared with downstream/proceeding states via the
state output, which is a mutable JSON object used to carry inputs &
output data between states.
Benefits:
• Upstream worker output can be used as input for downstream workers
(to reduce the number of repeat calls)
• Maintain state of upstream states
35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State input & output processing: Append to output
{
"name": "GetNewModel”,
"output": {
"diagnosticsFilePath": “20181102/model_diagnostics.zip",
"diagnosticsFileName": "model_diagnostics.zip”
}
}
# State is configured to append the decision to its input
{
"name": "GetManualReview",
"output": {
"diagnosticsFilePath": "20181102/model_diagnostics.zip",
"diagnosticsFileName": "model_diagnostics.zip",
"taskresult": {
"decision": "Approved"
}
}
}
36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
State input & output processing: Choice states
"ApproveOrRejectNewModel": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.taskresult.decision",
"StringEquals": "Approved",
"Next": "ApproveNewModel"
},
{
"Variable": "$.taskresult.decision",
"StringEquals": "Rejected",
"Next": "RejectNewModel"
}
]
}
37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Compute Blog: What we changed
• Step Functions
• Automating invocation of the state machine
• Using State input & output to pass upstream Lambda
state/data to downstream Lambdas
• > 1 state
• Amazon Simple Email Service
(Amazon SES)
• Initial setup
• Attachments
• Model delivery to Engineering
• Infrastructure as code
38. “Engineering & Data Science
development cadences are different.
An ability to asynchronously collaborate
reduces wait states and frustration.”
Cox Automotive
39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Decision Science learned about Engineering
• How to share
• AWS resources amongst different projects
• Infrastructure-as-code repo hierarchy and management
• An approach for working in multiple AWS environments (lab, non-prod,
prod)
41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What Engineering learned about Decision Science
• Human oversight is required to prevent unintended results and bias
• Data access & availability are real issues
• Are we collecting the right data to support future modeling efforts?
42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example ML workflow
def upload_to_s3(channel, file):
s3 = boto3.resource('s3')
data = open(file, "rb")
key = channel + '/' + file
s3.Bucket(bucket).put_object(Key=key, Body=data)
train = sagemaker.s3_input('s3://{}/train/'.format(bucket), content_type='application/x-recordio')
validation = sagemaker.s3_input('s3://{}/validation/'.format(bucket),
content_type='application/x-recordio')
input_data = 's3://batch-test-data/caltech256/'
output_data = 's3://batch-test-output/DEMO-image-classification'
transformer = training_job.transformer(2, 'ml.p3.2xlarge', output_path=output_data,
assemble_with='Line’, max_payload=8, max_concurrent_transforms=8)
transformer.transform(input_data, content_type='application/x-image')
43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ML workflow in Step Functions
44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Manage asynchronous jobs without writing code!
45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Simplify machine learning workflows
46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Add AWS Glue ETL jobs in your workflows
"Synchronously Run a Glue Job": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters":
{
"JobName.$": "$.myJobName”,
“AllocatedCapacity”: 3
},
"Catch": [
{"ErrorEquals": ["States.TaskFailed"],
"ResultPath": "$.cause",
"Next" : "Notify on Error"
} ],
"ResultPath": "$.jobInfo",
"Next": "Report Success"
}
47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Add Amazon SageMaker training and transform jobs
in your workflows
"Synchronously Run a Training Job": {
"Type": "Task",
"Resource":
"arn:aws:states:::sagemaker.createTrainingJob.sync",
"Parameters":
{
"AlgorithmSpecification": {...},
"HyperParameters": {...},
"InputDataConfig": [...],
...
},
"Catch": [
{"ErrorEquals": ["States.TaskFailed"],
"ResultPath": "$.cause",
"Next" : ”Notify on Error"
} ],
"ResultPath": "$.jobInfo",
"Next": "Report Success"
}
"Synchronously Run a Transform Job": {
"Type": "Task",
"Resource":
"arn:aws:states:::sagemaker.createTransformJob.sync",
"Parameters":
{
"TransformJobName.$": "$.transform",
"ModelName.$": "$.model",
"MaxConcurrentTransforms": 8,
...
},
"Catch": [
{"ErrorEquals": ["States.TaskFailed"],
"ResultPath": "$.cause",
"Next" : ”Notify on Error"
} ],
"ResultPath": "$.jobInfo",
"Next": "Report Success"
}
48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Define workflows in JSON
{
"StartAt": "Download",
"States": {
"Download": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCT:function:download_data”,
"Next": "Train"
},
"Train": {
"Type": "Task",
"Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
"ResultPath": "$.training_job",
"Parameters": {
"AlgorithmSpecification": {
"TrainingImage": "811284229777.dkr.ecr.us-east-1.amazonaws.com/
image-classification:latest",
"TrainingInputMode": "File"
}…
49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Cloud Developer Kit
JavaScript
TypeScript
Java
C#
Define your cloud resources using an imperative programming interface
50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Work in progress: Define workflows in Python
# Define an AWS Lambda task state
xferStep = stepfunctions.task(self,
name = 'Download’,
resource = lambda_.Function(self,
name = 'xfer_recio’,
code=lambda_.Code.file('CodeFile.zip’),
handler='download_data’,
runtime=lambda_.Runtime.python36,
timeout=15 * 60
),
result_path='$.training_data’,
)
# Define an Amazon SageMaker task state
trainStep = stepfunctions.task(self,
"Train",
resource =
'arn:aws:states:::sagemaker.createTrainingJob.sync’
parameters = (
TrainingJobName='string’,
HyperParameters={
...
# Define workflow in Python
Sfn_State_machine = (
xfer_step
.next(train_step.
add_catch(training_failure)
)
.next(create_model_step)
.next(transform_step.
add_catch(transform_failure)
)
.next(transform_success)
# Create an AWS Step Functions state machine
stepfunctions.StateMachine(self,
name = ‘ML Workflow’,
definition = sfn_state_Machine,
timeoutSec = 30000
)
51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon
SageMaker
AWS Step
Functions
52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Tuesday, November 27
API302 - Serverless State Management & Orchestration for Modern Apps
10:45 AM – 11:45 AM | MGM, Level 1, Grand Ballroom 122
Wednesday, November 28
SRV373 - Building Massively Parallel Event-Driven Architectures
6:15 PM – 7:15 PM | Venetian, Level 3, Murano 3205
Thursday, November 29
AIM403 - Integrate Amazon SageMaker with Apache Spark
4:00 PM – 5:00 PM | Mirage, Grand Ballroom F
53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
https://aws.amazon.com/machine-learning/
https://aws.amazon.com/modern-apps/
https://github.com/awslabs/aws-cdk
55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
https://aws.amazon.com/machine-learning/
https://aws.amazon.com/modern-apps/
https://github.com/awslabs/aws-cdk