© 2021, Amazon Web Services, Inc. or its Affiliates.
Speed up data preparation for ML
pipelines on AWS
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli
Data Science Milan Meetup
21 April 2021
© 2021, Amazon Web Services, Inc. or its Affiliates.
Customers moving from traditional data
warehouse approach
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
Data Lake
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lake House architecture on AWS
Scalable data lakes
Purpose-built
data services
Seamless
Data movement
Unified governance
Performant and
cost-effective
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
EMR
Amazon
Aurora
Amazon
Athena
Amazon
S3
© 2021, Amazon Web Services, Inc. or its Affiliates.
The AWS analytics portfolio
Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
+ many more
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark &
Python)
S3/Glacier AWS Glue
Lake
Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
Pinpoint
Data
Exchange
© 2021, Amazon Web Services, Inc. or its Affiliates.
Accelerate your predictive analytics & machine learning journey
Broadest and most complete set of machine learning capabilities
Amazon
SageMaker
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS
Deep
Learning
AMIs &
Containers
GPUs &
CPUs
Elastic
Inference
Trainium Inferentia FPGA
AI SERVICES
ML SERVICES
FRAMEWORKS & INFRASTRUCTURE
DeepGraphLibrary
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Comprehend
+Medical
Amazon
Textract
Amazon
Kendra
Amazon
CodeGuru
Amazon
Fraud Detector
Amazon
Translate
INDUSTRIAL AI CODE AND DEVOPS
NEW
Amazon
DevOps Guru
Voice ID
For Amazon Connect
Contact Lens
NEW
Amazon
Monitron
NEW
AWS Panorama
+ Appliance
NEW
Amazon Lookout
for Vision
NEW
Amazon Lookout
for Equipment
NEW
Amazon
HealthLake
HEALTH AI
NEW
Amazon Lookout
for Metrics
ANOMALY DETECTION
Amazon
Transcribe
for Medical
Amazon
Comprehend
for Medical
Label
data
NEW
Aggregate &
prepare data
NEW
Store & share
features
Auto ML Spark/R
NEW
Detect
bias
Visualize in
notebooks
Pick
algorithm
Train
models
Tune
parameters
NEW
Debug &
profile
Deploy in
production
Manage
& monitor
NEW
CI/CD
Human
review
NEW: Model management for edge devices
NEW: SageMaker JumpStart
SAGEMAKER STUDIO IDE
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker: Built to make ML more accessible
Pick
algorithm
Visualize in
notebooks
Label
data
Collect and
prepare data
Store
features
Check
data
Train
models
Tune
parameters
Deploy in
production
Manage
and monitor
CI/CD
SageMaker Studio IDE
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue
Serverless Data
Integration for
Complex Workloads
Serverless
There is no infrastructure to maintain. Allocate needed compute power and run
jobs.
Cost-effective
All-in-one pricing model includes infrastructure and is 55% cheaper than other
cloud data integration options
Handles complex workloads
Glue connects to hundreds of data sources, processes petabytes of data in real-
time, batch and event driven modes
No lock in
Develop data integration pipelines in open source SparkSQL, PySpark and
Scala
Data Integration for every user
Development environments catered to different skillsets - visual ETL development for
Data Engineers, notebook styled development for Data Scientists and no code
development for Data Analysts
© 2021, Amazon Web Services, Inc. or its Affiliates.
How customers use
AWS Glue
Prepare data for Machine Learning
Migrate from expensive traditional ETL solutions
to gain flexibility and reduce costs
Process petabytes of data both in batch and real-
time using Apache Spark
Build Data Lakes and Lake Houses for scalable
data analysis
Catalog data assets to make them available to
AWS Analytics services
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue: data integration platform for building Lake Houses faster
Connect
Amazon RDS
Other databases
On-premises data
Streaming data
Connect to data
sources using Glue
Connector
Catalog
Catalog Streaming
data in Glue Schema
Registry
Catalog structured
and semi structured
Data in Glue Catalog
Discover Schema
with Glue Crawlers
Transform
Transform without
writing code using
Glue Databrew
Interactively transform
data using Dev Endpoints
Visually transform
data using Glue Studio
Easily replicate data across
Lakehouse
with Glue Elastic View
LAKE HOUSE
Data lake
NoSQL
Data Warehouse
Log Analytics
Big Data
Relational
Machine
Learning
SaaS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue is used to modernize on premises ETL tools
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue is used to prepare raw data for Machine Learning
Logs, app data
Amazon RDS
Other databases
On-premises data
Streaming data
AWS Glue
ingest
cleaned and
enriched data
extracted
features
training
data
Notebooks:
data exploration,
experimentation
raw data
AWS Glue
DataBrew
transform
AWS Glue
DataBrew
transform
AWS Glue
transform
Glue Catalog
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue components
Crawlers
load and maintain
data catalog
infer metadata:
schema, table
structure
supports schema
evolution
Data Catalog
Apache Hive Metastore
compatible
many integrated
analytic services
Extract,
Transform, and Load
serverless execution
Apache Spark / Python
shell jobs
interactive development
auto-generate ETL code
orchestrate triggers,
crawlers and jobs
build and monitor
complex flows
integrated alerting
Workflow
Management
© 2021, Amazon Web Services, Inc. or its Affiliates.
Unified Data Catalog with Automated Schema Discovery
Breaking down data silos with a unified metadata catalog for the entire data landscape
OLTP ERP CRM
Data Warehouse
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors
Automated Schema discovery and management
Transactional
systems
Structured and Semi-Structured
discovery (Glue Crawlers)
No movement of data = Low
Costs/Admin
All metadata centrally available for
search and query = Productivity
Automate data discovery = Productivity
Unify structured, semi-structured data
= Speed to Insight
Machine Learning
DW
Queries
Big data
processing
Interactive Real-time Business
Intelligence
Data Catalog
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Workflows
Inbuilt Scheduler to orchestrate jobs
Multiple triggering mechanisms
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion
§ On-demand: e.g., AWS Lambda
Easy to access logs and monitor
progress
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue 2.0 Engine
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Dynamic Frames
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Transforms – Relationalize (example)
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Connectors
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue – Build Your Own Connector
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue – Clean and prepare real time and batch data
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue Studio: New visual ETL interface
M A K E S I T E A S Y T O A U T H O R , R U N , A N D M O N I T O R A W S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of glass
Distributed processing without the learning curve
Advanced transforms though code snippets
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Glue Studio
© 2021, Amazon Web Services, Inc. or its Affiliates.
“Our teams spend too much time on the
undifferentiated, repetitive, and
mundane tasks associated with data
preparation.”
© 2021, Amazon Web Services, Inc. or its Affiliates.
Extraction & loading
Cleaning &
normalization
Orchestrating
at scale
Preparing data involves several complex tasks
Needs a lot of code-based heavy-lifting to work at scale
© 2021, Amazon Web Services, Inc. or its Affiliates.
As much as 80% of time is spent preparing data today
Needs the right tool for the right persona
© 2021, Amazon Web Services, Inc. or its Affiliates.
Challenges with traditional data preparation
Time consuming
Needs the right tools for the right persona that are integrated
Manual
Needs a lot of code-based heavy-lifting for it to work at scale
Siloed
Often requires moving large amounts of data into silos, at times out of VPCs
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Clean and normalize data up to 80% faster
© 2021, Amazon Web Services, Inc. or its Affiliates.
Built for data analysts and data scientists
Clean and
normalize data
Over 250 built-in
transformations
Understand
data quality
Understand patterns
and detect anomalies
using profiles
Visually map
data lineage
Understand steps
that the data has
been through
Automate
at scale
Save transformations
and apply to new data
as it comes in
Data preparation made easy
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Glue DataBrew
© 2021, Amazon Web Services, Inc. or its Affiliates.
What we saw in the demo
Build a recipe
Profile the data Run a job
Operationalize at scale
Schedule jobs Use APIs/SDK Reuse recipes
© 2021, Amazon Web Services, Inc. or its Affiliates.
Popular use cases
© 2021, Amazon Web Services, Inc. or its Affiliates.
One-time data analysis for business reporting
Amazon S3
AWS Glue
DataBrew
Amazon QuickSight
Amazon S3
output bucket
Amazon Redshift
Amazon RDS
Data catalog
data sources
Amazon Simple
Storage Service
(Amazon S3)
Local file
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon Simple
Notification Service
Amazon EventBridge
Email notification
AWS Lambda
Amazon S3
AWS Glue
DataBrew
Recurring raw
data feed
Set up data quality rules with AWS Lambda
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data preprocessing for machine learning
Amazon S3 AWS Glue
DataBrew
JupyterLab environment
Inference
Amazon S3
output bucket
Model training
© 2021, Amazon Web Services, Inc. or its Affiliates.
Orchestrating data preparation in workflows
AWS Step Functions workflow
AWS Glue
DataBrew
AWS Glue
Data catalog
Amazon Redshift
Crawler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Data Wrangler
41
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker
Data Wrangler
The fastest and easiest
way to prepare data for
machine learning
Support for data from multiple sources
Quickly select and query data
Use built-in data transformations to covert raw data to features for machine
learning
Easily transform data with built-in data transformations
Complete flexibility to bring your own custom transformations in in PySpark, SQL,
or Pandas
Customize data transformations
Quickly detect outliers or extreme values – all without writing code
Understand data visually
Diagnose potential issues in data preparation workflows that could hinder ML model
accuracy
Quickly estimate ML model accuracy
Deploy data preparation workflows into production with a
single click
Manage all steps of the data preparation workflow through a single visual interface to
quickly operationalize workflows into production settings
42
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker Data Wrangler
Use Cases
Cleanse & Explore Data
Use built-in data transformations to
accelerate data cleansing and
exploration
Visualize & Understand Data Enrich Data
Quickly detect outliers or
extreme values within a data set
without the need to write code
Use built-in data transformation tools to
transform data into formats that can be
used to build accurate ML models
43
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Quickly select and query data
Select data from Amazon Athena,
Amazon Redshift, AWS Lake
Formation, Amazon S3, and features
from SageMaker Feature Store
Write queries for data sources before
importing data over to SageMaker
Data Wrangler
Import data in various file formats,
such as CSV files, Parquet files, and
database tables directly into Amazon
SageMaker
44
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Easily transform data
Transform your data without writing a
single line of code using over 300 built-in
data transformations
Built-in data transformations include
convert column type, rename column, and
delete column
Author custom transformations in
PySpark, SQL, and Pandas
45
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Understand your data visually
Intuitively understand your data with a set
of pre-configured visualization templates
Pre-configured visualization templates
include histograms, scatter plots, box and
whisker plots, line plots, and bar charts
Interactively create and edit your own
visualizations so you can quickly detect
outliers or extreme values
46
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Quickly estimate model accuracy
Identify inconsistencies in data
preparation workflows and diagnose
issues before ML models are deployed
into production
Select subsets of data to identify errors
Identify which features are contributing
to model performance relative to others
Determine if additional feature
engineering is needed to improve model
performance
47
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Deploy data preparation workflows into production
Export data preparation workflows to a
notebook or Python code
Integrate your workflow with
SageMaker Pipelines to automate
model deployment and management
Publish created features to SageMaker
Feature Store for reuse and syndication
across teams and projects
48
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker Data Wrangler pricing and availability
Generally Available
Priced per instance
usage
Available in all regions
where SageMaker
Studio is available
© 2021, Amazon Web Services, Inc. or its Affiliates.
Thank you
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli

Speed up data preparation for ML pipelines on AWS

  • 1.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Speed up data preparation for ML pipelines on AWS Francesco Marelli Senior Solutions Architect https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli Data Science Milan Meetup 21 April 2021
  • 2.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Customers moving from traditional data warehouse approach Data silos to OLTP ERP CRM LOB DW Silo 1 Business Intelligence Devices Web Sensors Social DW Silo 2 Business Intelligence Data Lake Non- relational databases Machine learning Data warehousing Log analytics Big data processing Relational databases
  • 3.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Lake House architecture on AWS Scalable data lakes Purpose-built data services Seamless Data movement Unified governance Performant and cost-effective Amazon DynamoDB Amazon SageMaker Amazon Redshift Amazon Elasticsearch Service Amazon EMR Amazon Aurora Amazon Athena Amazon S3
  • 4.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. The AWS analytics portfolio Data movement Analytics Data lake infrastructure & management Data, visualization, engagement, & machine learning + many more Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3/Glacier AWS Glue Lake Formation QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka Pinpoint Data Exchange
  • 5.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Accelerate your predictive analytics & machine learning journey Broadest and most complete set of machine learning capabilities Amazon SageMaker VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Trainium Inferentia FPGA AI SERVICES ML SERVICES FRAMEWORKS & INFRASTRUCTURE DeepGraphLibrary Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Lex Amazon Personalize Amazon Forecast Amazon Comprehend +Medical Amazon Textract Amazon Kendra Amazon CodeGuru Amazon Fraud Detector Amazon Translate INDUSTRIAL AI CODE AND DEVOPS NEW Amazon DevOps Guru Voice ID For Amazon Connect Contact Lens NEW Amazon Monitron NEW AWS Panorama + Appliance NEW Amazon Lookout for Vision NEW Amazon Lookout for Equipment NEW Amazon HealthLake HEALTH AI NEW Amazon Lookout for Metrics ANOMALY DETECTION Amazon Transcribe for Medical Amazon Comprehend for Medical Label data NEW Aggregate & prepare data NEW Store & share features Auto ML Spark/R NEW Detect bias Visualize in notebooks Pick algorithm Train models Tune parameters NEW Debug & profile Deploy in production Manage & monitor NEW CI/CD Human review NEW: Model management for edge devices NEW: SageMaker JumpStart SAGEMAKER STUDIO IDE
  • 6.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Amazon SageMaker: Built to make ML more accessible Pick algorithm Visualize in notebooks Label data Collect and prepare data Store features Check data Train models Tune parameters Deploy in production Manage and monitor CI/CD SageMaker Studio IDE
  • 7.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Glue Serverless Data Integration for Complex Workloads Serverless There is no infrastructure to maintain. Allocate needed compute power and run jobs. Cost-effective All-in-one pricing model includes infrastructure and is 55% cheaper than other cloud data integration options Handles complex workloads Glue connects to hundreds of data sources, processes petabytes of data in real- time, batch and event driven modes No lock in Develop data integration pipelines in open source SparkSQL, PySpark and Scala Data Integration for every user Development environments catered to different skillsets - visual ETL development for Data Engineers, notebook styled development for Data Scientists and no code development for Data Analysts
  • 8.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. How customers use AWS Glue Prepare data for Machine Learning Migrate from expensive traditional ETL solutions to gain flexibility and reduce costs Process petabytes of data both in batch and real- time using Apache Spark Build Data Lakes and Lake Houses for scalable data analysis Catalog data assets to make them available to AWS Analytics services
  • 9.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue: data integration platform for building Lake Houses faster Connect Amazon RDS Other databases On-premises data Streaming data Connect to data sources using Glue Connector Catalog Catalog Streaming data in Glue Schema Registry Catalog structured and semi structured Data in Glue Catalog Discover Schema with Glue Crawlers Transform Transform without writing code using Glue Databrew Interactively transform data using Dev Endpoints Visually transform data using Glue Studio Easily replicate data across Lakehouse with Glue Elastic View LAKE HOUSE Data lake NoSQL Data Warehouse Log Analytics Big Data Relational Machine Learning SaaS
  • 10.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue is used to modernize on premises ETL tools
  • 11.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue is used to prepare raw data for Machine Learning Logs, app data Amazon RDS Other databases On-premises data Streaming data AWS Glue ingest cleaned and enriched data extracted features training data Notebooks: data exploration, experimentation raw data AWS Glue DataBrew transform AWS Glue DataBrew transform AWS Glue transform Glue Catalog
  • 12.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Glue components Crawlers load and maintain data catalog infer metadata: schema, table structure supports schema evolution Data Catalog Apache Hive Metastore compatible many integrated analytic services Extract, Transform, and Load serverless execution Apache Spark / Python shell jobs interactive development auto-generate ETL code orchestrate triggers, crawlers and jobs build and monitor complex flows integrated alerting Workflow Management
  • 13.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Unified Data Catalog with Automated Schema Discovery Breaking down data silos with a unified metadata catalog for the entire data landscape OLTP ERP CRM Data Warehouse Data Lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Automated Schema discovery and management Transactional systems Structured and Semi-Structured discovery (Glue Crawlers) No movement of data = Low Costs/Admin All metadata centrally available for search and query = Productivity Automate data discovery = Productivity Unify structured, semi-structured data = Speed to Insight Machine Learning DW Queries Big data processing Interactive Real-time Business Intelligence Data Catalog
  • 14.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue Workflows Inbuilt Scheduler to orchestrate jobs Multiple triggering mechanisms § Schedule-based: e.g., time of day § Event-based: e.g., job completion § On-demand: e.g., AWS Lambda Easy to access logs and monitor progress Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 15.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Glue 2.0 Engine
  • 16.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue Dynamic Frames
  • 17.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue Transforms – Relationalize (example)
  • 18.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue Connectors
  • 19.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue – Build Your Own Connector
  • 20.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Glue – Clean and prepare real time and batch data
  • 21.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Glue Studio: New visual ETL interface M A K E S I T E A S Y T O A U T H O R , R U N , A N D M O N I T O R A W S G L U E E T L J O B S Author AWS Glue jobs visually without coding Monitor 1000s of jobs through a single pane of glass Distributed processing without the learning curve Advanced transforms though code snippets
  • 22.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. © 2021, Amazon Web Services, Inc. or its Affiliates. Demo AWS Glue Studio
  • 23.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. “Our teams spend too much time on the undifferentiated, repetitive, and mundane tasks associated with data preparation.”
  • 24.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Extraction & loading Cleaning & normalization Orchestrating at scale Preparing data involves several complex tasks Needs a lot of code-based heavy-lifting to work at scale
  • 25.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. As much as 80% of time is spent preparing data today Needs the right tool for the right persona
  • 26.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Challenges with traditional data preparation Time consuming Needs the right tools for the right persona that are integrated Manual Needs a lot of code-based heavy-lifting for it to work at scale Siloed Often requires moving large amounts of data into silos, at times out of VPCs
  • 27.
    © 2021, AmazonWeb Services, Inc. or its Affiliates.
  • 28.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Clean and normalize data up to 80% faster
  • 29.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Built for data analysts and data scientists Clean and normalize data Over 250 built-in transformations Understand data quality Understand patterns and detect anomalies using profiles Visually map data lineage Understand steps that the data has been through Automate at scale Save transformations and apply to new data as it comes in Data preparation made easy
  • 30.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Demo AWS Glue DataBrew
  • 31.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. What we saw in the demo Build a recipe Profile the data Run a job Operationalize at scale Schedule jobs Use APIs/SDK Reuse recipes
  • 32.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Popular use cases
  • 33.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. One-time data analysis for business reporting Amazon S3 AWS Glue DataBrew Amazon QuickSight Amazon S3 output bucket Amazon Redshift Amazon RDS Data catalog data sources Amazon Simple Storage Service (Amazon S3) Local file
  • 34.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Amazon Simple Notification Service Amazon EventBridge Email notification AWS Lambda Amazon S3 AWS Glue DataBrew Recurring raw data feed Set up data quality rules with AWS Lambda
  • 35.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Data preprocessing for machine learning Amazon S3 AWS Glue DataBrew JupyterLab environment Inference Amazon S3 output bucket Model training
  • 36.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Orchestrating data preparation in workflows AWS Step Functions workflow AWS Glue DataBrew AWS Glue Data catalog Amazon Redshift Crawler
  • 37.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Data Wrangler
  • 38.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Data Wrangler
  • 39.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Data Wrangler
  • 40.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. © 2021, Amazon Web Services, Inc. or its Affiliates. Demo AWS Data Wrangler
  • 41.
    41 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler The fastest and easiest way to prepare data for machine learning Support for data from multiple sources Quickly select and query data Use built-in data transformations to covert raw data to features for machine learning Easily transform data with built-in data transformations Complete flexibility to bring your own custom transformations in in PySpark, SQL, or Pandas Customize data transformations Quickly detect outliers or extreme values – all without writing code Understand data visually Diagnose potential issues in data preparation workflows that could hinder ML model accuracy Quickly estimate ML model accuracy Deploy data preparation workflows into production with a single click Manage all steps of the data preparation workflow through a single visual interface to quickly operationalize workflows into production settings
  • 42.
    42 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler Use Cases Cleanse & Explore Data Use built-in data transformations to accelerate data cleansing and exploration Visualize & Understand Data Enrich Data Quickly detect outliers or extreme values within a data set without the need to write code Use built-in data transformation tools to transform data into formats that can be used to build accurate ML models
  • 43.
    43 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | Quickly select and query data Select data from Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon S3, and features from SageMaker Feature Store Write queries for data sources before importing data over to SageMaker Data Wrangler Import data in various file formats, such as CSV files, Parquet files, and database tables directly into Amazon SageMaker
  • 44.
    44 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | Easily transform data Transform your data without writing a single line of code using over 300 built-in data transformations Built-in data transformations include convert column type, rename column, and delete column Author custom transformations in PySpark, SQL, and Pandas
  • 45.
    45 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | Understand your data visually Intuitively understand your data with a set of pre-configured visualization templates Pre-configured visualization templates include histograms, scatter plots, box and whisker plots, line plots, and bar charts Interactively create and edit your own visualizations so you can quickly detect outliers or extreme values
  • 46.
    46 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | Quickly estimate model accuracy Identify inconsistencies in data preparation workflows and diagnose issues before ML models are deployed into production Select subsets of data to identify errors Identify which features are contributing to model performance relative to others Determine if additional feature engineering is needed to improve model performance
  • 47.
    47 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | Deploy data preparation workflows into production Export data preparation workflows to a notebook or Python code Integrate your workflow with SageMaker Pipelines to automate model deployment and management Publish created features to SageMaker Feature Store for reuse and syndication across teams and projects
  • 48.
    48 © 2020 AmazonWeb Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler pricing and availability Generally Available Priced per instance usage Available in all regions where SageMaker Studio is available
  • 49.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Thank you Francesco Marelli Senior Solutions Architect https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli