Speed up data preparation for ML pipelines on AWS

© 2021, Amazon Web Services, Inc. or its Affiliates.
Speed up data preparation for ML
pipelines on AWS
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli
Data Science Milan Meetup
21 April 2021

Customers moving from traditional data
warehouse approach
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
Data Lake
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases

Lake House architecture on AWS
Scalable data lakes
Purpose-built
data services
Seamless
Data movement
Unified governance
Performant and
cost-effective
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
EMR
Amazon
Aurora
Amazon
Athena
Amazon
S3

The AWS analytics portfolio
Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
+ many more
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark &
Python)
S3/Glacier AWS Glue
Lake
Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
Pinpoint
Data
Exchange

Accelerate your predictive analytics & machine learning journey
Broadest and most complete set of machine learning capabilities
Amazon
SageMaker
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS
Deep
Learning
AMIs &
Containers
GPUs &
CPUs
Elastic
Inference
Trainium Inferentia FPGA
AI SERVICES
ML SERVICES
FRAMEWORKS & INFRASTRUCTURE
DeepGraphLibrary
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Comprehend
+Medical
Amazon
Textract
Amazon
Kendra
Amazon
CodeGuru
Amazon
Fraud Detector
Amazon
Translate
INDUSTRIAL AI CODE AND DEVOPS
NEW
Amazon
DevOps Guru
Voice ID
For Amazon Connect
Contact Lens
NEW
Amazon
Monitron
NEW
AWS Panorama
+ Appliance
NEW
Amazon Lookout
for Vision
NEW
Amazon Lookout
for Equipment
NEW
Amazon
HealthLake
HEALTH AI
NEW
Amazon Lookout
for Metrics
ANOMALY DETECTION
Amazon
Transcribe
for Medical
Amazon
Comprehend
for Medical
Label
data
NEW
Aggregate &
prepare data
NEW
Store & share
features
Auto ML Spark/R
NEW
Detect
bias
Visualize in
notebooks
Pick
algorithm
Train
models
Tune
parameters
NEW
Debug &
profile
Deploy in
production
Manage
& monitor
NEW
CI/CD
Human
review
NEW: Model management for edge devices
NEW: SageMaker JumpStart
SAGEMAKER STUDIO IDE

Amazon SageMaker: Built to make ML more accessible
Pick
algorithm
Visualize in
notebooks
Label
data
Collect and
prepare data
Store
features
Check
data
Train
models
Tune
parameters
Deploy in
production
Manage
and monitor
CI/CD
SageMaker Studio IDE

AWS Glue
Serverless Data
Integration for
Complex Workloads
Serverless
There is no infrastructure to maintain. Allocate needed compute power and run
jobs.
Cost-effective
All-in-one pricing model includes infrastructure and is 55% cheaper than other
cloud data integration options
Handles complex workloads
Glue connects to hundreds of data sources, processes petabytes of data in real-
time, batch and event driven modes
No lock in
Develop data integration pipelines in open source SparkSQL, PySpark and
Scala
Data Integration for every user
Development environments catered to different skillsets - visual ETL development for
Data Engineers, notebook styled development for Data Scientists and no code
development for Data Analysts

How customers use
AWS Glue
Prepare data for Machine Learning
Migrate from expensive traditional ETL solutions
to gain flexibility and reduce costs
Process petabytes of data both in batch and real-
time using Apache Spark
Build Data Lakes and Lake Houses for scalable
data analysis
Catalog data assets to make them available to
AWS Analytics services

Glue: data integration platform for building Lake Houses faster
Connect
Amazon RDS
Other databases
On-premises data
Streaming data
Connect to data
sources using Glue
Connector
Catalog
Catalog Streaming
data in Glue Schema
Registry
Catalog structured
and semi structured
Data in Glue Catalog
Discover Schema
with Glue Crawlers
Transform
Transform without
writing code using
Glue Databrew
Interactively transform
data using Dev Endpoints
Visually transform
data using Glue Studio
Easily replicate data across
Lakehouse
with Glue Elastic View
LAKE HOUSE
Data lake
NoSQL
Data Warehouse
Log Analytics
Big Data
Relational
Machine
Learning
SaaS

Glue is used to modernize on premises ETL tools

Glue is used to prepare raw data for Machine Learning
Logs, app data
Amazon RDS
Other databases
On-premises data
Streaming data
AWS Glue
ingest
cleaned and
enriched data
extracted
features
training
data
Notebooks:
data exploration,
experimentation
raw data
AWS Glue
DataBrew
transform
AWS Glue
DataBrew
transform
AWS Glue
transform
Glue Catalog

AWS Glue components
Crawlers
load and maintain
data catalog
infer metadata:
schema, table
structure
supports schema
evolution
Data Catalog
Apache Hive Metastore
compatible
many integrated
analytic services
Extract,
Transform, and Load
serverless execution
Apache Spark / Python
shell jobs
interactive development
auto-generate ETL code
orchestrate triggers,
crawlers and jobs
build and monitor
complex flows
integrated alerting
Workflow
Management

Unified Data Catalog with Automated Schema Discovery
Breaking down data silos with a unified metadata catalog for the entire data landscape
OLTP ERP CRM
Data Warehouse
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors
Automated Schema discovery and management
Transactional
systems
Structured and Semi-Structured
discovery (Glue Crawlers)
No movement of data = Low
Costs/Admin
All metadata centrally available for
search and query = Productivity
Automate data discovery = Productivity
Unify structured, semi-structured data
= Speed to Insight
Machine Learning
DW
Queries
Big data
processing
Interactive Real-time Business
Intelligence
Data Catalog

Glue Workflows
Inbuilt Scheduler to orchestrate jobs
Multiple triggering mechanisms
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion
§ On-demand: e.g., AWS Lambda
Easy to access logs and monitor
progress
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based

AWS Glue 2.0 Engine

Glue Dynamic Frames

Glue Transforms – Relationalize (example)

Glue Connectors

Glue – Build Your Own Connector

Glue – Clean and prepare real time and batch data

AWS Glue Studio: New visual ETL interface
M A K E S I T E A S Y T O A U T H O R , R U N , A N D M O N I T O R A W S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of glass
Distributed processing without the learning curve
Advanced transforms though code snippets

Demo
AWS Glue Studio

“Our teams spend too much time on the
undifferentiated, repetitive, and
mundane tasks associated with data
preparation.”

Extraction & loading
Cleaning &
normalization
Orchestrating
at scale
Preparing data involves several complex tasks
Needs a lot of code-based heavy-lifting to work at scale

As much as 80% of time is spent preparing data today
Needs the right tool for the right persona

Challenges with traditional data preparation
Time consuming
Needs the right tools for the right persona that are integrated
Manual
Needs a lot of code-based heavy-lifting for it to work at scale
Siloed
Often requires moving large amounts of data into silos, at times out of VPCs

Clean and normalize data up to 80% faster

Built for data analysts and data scientists
Clean and
normalize data
Over 250 built-in
transformations
Understand
data quality
Understand patterns
and detect anomalies
using profiles
Visually map
data lineage
Understand steps
that the data has
been through
Automate
at scale
Save transformations
and apply to new data
as it comes in
Data preparation made easy

Demo
AWS Glue DataBrew

What we saw in the demo
Build a recipe
Profile the data Run a job
Operationalize at scale
Schedule jobs Use APIs/SDK Reuse recipes

Popular use cases

One-time data analysis for business reporting
Amazon S3
AWS Glue
DataBrew
Amazon QuickSight
Amazon S3
output bucket
Amazon Redshift
Amazon RDS
Data catalog
data sources
Amazon Simple
Storage Service
(Amazon S3)
Local file

Amazon Simple
Notification Service
Amazon EventBridge
Email notification
AWS Lambda
Amazon S3
AWS Glue
DataBrew
Recurring raw
data feed
Set up data quality rules with AWS Lambda

Data preprocessing for machine learning
Amazon S3 AWS Glue
DataBrew
JupyterLab environment
Inference
Amazon S3
output bucket
Model training

Orchestrating data preparation in workflows
AWS Step Functions workflow
AWS Glue
DataBrew
AWS Glue
Data catalog
Amazon Redshift
Crawler

AWS Data Wrangler

Demo
AWS Data Wrangler

41
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker
Data Wrangler
The fastest and easiest
way to prepare data for
machine learning
Support for data from multiple sources
Quickly select and query data
Use built-in data transformations to covert raw data to features for machine
learning
Easily transform data with built-in data transformations
Complete flexibility to bring your own custom transformations in in PySpark, SQL,
or Pandas
Customize data transformations
Quickly detect outliers or extreme values – all without writing code
Understand data visually
Diagnose potential issues in data preparation workflows that could hinder ML model
accuracy
Quickly estimate ML model accuracy
Deploy data preparation workflows into production with a
single click
Manage all steps of the data preparation workflow through a single visual interface to
quickly operationalize workflows into production settings

42
SageMaker Data Wrangler
Use Cases
Cleanse & Explore Data
Use built-in data transformations to
accelerate data cleansing and
exploration
Visualize & Understand Data Enrich Data
Quickly detect outliers or
extreme values within a data set
without the need to write code
Use built-in data transformation tools to
transform data into formats that can be
used to build accurate ML models

43
Quickly select and query data
Select data from Amazon Athena,
Amazon Redshift, AWS Lake
Formation, Amazon S3, and features
from SageMaker Feature Store
Write queries for data sources before
importing data over to SageMaker
Data Wrangler
Import data in various file formats,
such as CSV files, Parquet files, and
database tables directly into Amazon
SageMaker

44
Easily transform data
Transform your data without writing a
single line of code using over 300 built-in
data transformations
Built-in data transformations include
convert column type, rename column, and
delete column
Author custom transformations in
PySpark, SQL, and Pandas

45
Understand your data visually
Intuitively understand your data with a set
of pre-configured visualization templates
Pre-configured visualization templates
include histograms, scatter plots, box and
whisker plots, line plots, and bar charts
Interactively create and edit your own
visualizations so you can quickly detect
outliers or extreme values

46
Quickly estimate model accuracy
Identify inconsistencies in data
preparation workflows and diagnose
issues before ML models are deployed
into production
Select subsets of data to identify errors
Identify which features are contributing
to model performance relative to others
Determine if additional feature
engineering is needed to improve model
performance

47
Deploy data preparation workflows into production
Export data preparation workflows to a
notebook or Python code
Integrate your workflow with
SageMaker Pipelines to automate
model deployment and management
Publish created features to SageMaker
Feature Store for reuse and syndication
across teams and projects

48
SageMaker Data Wrangler pricing and availability
Generally Available
Priced per instance
usage
Available in all regions
where SageMaker
Studio is available

Thank you
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli

Speed up data preparation for ML pipelines on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speed up data preparation for ML pipelines on AWS

Similar to Speed up data preparation for ML pipelines on AWS (20)

More from Data Science Milan

More from Data Science Milan (20)

Recently uploaded

Recently uploaded (20)

Speed up data preparation for ML pipelines on AWS