SlideShare a Scribd company logo
© 2021, Amazon Web Services, Inc. or its Affiliates.
Speed up data preparation for ML
pipelines on AWS
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli
Data Science Milan Meetup
21 April 2021
© 2021, Amazon Web Services, Inc. or its Affiliates.
Customers moving from traditional data
warehouse approach
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
Data Lake
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lake House architecture on AWS
Scalable data lakes
Purpose-built
data services
Seamless
Data movement
Unified governance
Performant and
cost-effective
Amazon
DynamoDB
Amazon
SageMaker
Amazon
Redshift
Amazon
Elasticsearch
Service
Amazon
EMR
Amazon
Aurora
Amazon
Athena
Amazon
S3
© 2021, Amazon Web Services, Inc. or its Affiliates.
The AWS analytics portfolio
Data movement
Analytics
Data lake infrastructure & management
Data, visualization, engagement, & machine learning
+ many more
Redshift
EMR (Spark &
Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
AWS Glue
(Spark &
Python)
S3/Glacier AWS Glue
Lake
Formation
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate
Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka
Pinpoint
Data
Exchange
© 2021, Amazon Web Services, Inc. or its Affiliates.
Accelerate your predictive analytics & machine learning journey
Broadest and most complete set of machine learning capabilities
Amazon
SageMaker
VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS
Deep
Learning
AMIs &
Containers
GPUs &
CPUs
Elastic
Inference
Trainium Inferentia FPGA
AI SERVICES
ML SERVICES
FRAMEWORKS & INFRASTRUCTURE
DeepGraphLibrary
Amazon
Rekognition
Amazon
Polly
Amazon
Transcribe
+Medical
Amazon
Lex
Amazon
Personalize
Amazon
Forecast
Amazon
Comprehend
+Medical
Amazon
Textract
Amazon
Kendra
Amazon
CodeGuru
Amazon
Fraud Detector
Amazon
Translate
INDUSTRIAL AI CODE AND DEVOPS
NEW
Amazon
DevOps Guru
Voice ID
For Amazon Connect
Contact Lens
NEW
Amazon
Monitron
NEW
AWS Panorama
+ Appliance
NEW
Amazon Lookout
for Vision
NEW
Amazon Lookout
for Equipment
NEW
Amazon
HealthLake
HEALTH AI
NEW
Amazon Lookout
for Metrics
ANOMALY DETECTION
Amazon
Transcribe
for Medical
Amazon
Comprehend
for Medical
Label
data
NEW
Aggregate &
prepare data
NEW
Store & share
features
Auto ML Spark/R
NEW
Detect
bias
Visualize in
notebooks
Pick
algorithm
Train
models
Tune
parameters
NEW
Debug &
profile
Deploy in
production
Manage
& monitor
NEW
CI/CD
Human
review
NEW: Model management for edge devices
NEW: SageMaker JumpStart
SAGEMAKER STUDIO IDE
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker: Built to make ML more accessible
Pick
algorithm
Visualize in
notebooks
Label
data
Collect and
prepare data
Store
features
Check
data
Train
models
Tune
parameters
Deploy in
production
Manage
and monitor
CI/CD
SageMaker Studio IDE
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue
Serverless Data
Integration for
Complex Workloads
Serverless
There is no infrastructure to maintain. Allocate needed compute power and run
jobs.
Cost-effective
All-in-one pricing model includes infrastructure and is 55% cheaper than other
cloud data integration options
Handles complex workloads
Glue connects to hundreds of data sources, processes petabytes of data in real-
time, batch and event driven modes
No lock in
Develop data integration pipelines in open source SparkSQL, PySpark and
Scala
Data Integration for every user
Development environments catered to different skillsets - visual ETL development for
Data Engineers, notebook styled development for Data Scientists and no code
development for Data Analysts
© 2021, Amazon Web Services, Inc. or its Affiliates.
How customers use
AWS Glue
Prepare data for Machine Learning
Migrate from expensive traditional ETL solutions
to gain flexibility and reduce costs
Process petabytes of data both in batch and real-
time using Apache Spark
Build Data Lakes and Lake Houses for scalable
data analysis
Catalog data assets to make them available to
AWS Analytics services
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue: data integration platform for building Lake Houses faster
Connect
Amazon RDS
Other databases
On-premises data
Streaming data
Connect to data
sources using Glue
Connector
Catalog
Catalog Streaming
data in Glue Schema
Registry
Catalog structured
and semi structured
Data in Glue Catalog
Discover Schema
with Glue Crawlers
Transform
Transform without
writing code using
Glue Databrew
Interactively transform
data using Dev Endpoints
Visually transform
data using Glue Studio
Easily replicate data across
Lakehouse
with Glue Elastic View
LAKE HOUSE
Data lake
NoSQL
Data Warehouse
Log Analytics
Big Data
Relational
Machine
Learning
SaaS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue is used to modernize on premises ETL tools
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue is used to prepare raw data for Machine Learning
Logs, app data
Amazon RDS
Other databases
On-premises data
Streaming data
AWS Glue
ingest
cleaned and
enriched data
extracted
features
training
data
Notebooks:
data exploration,
experimentation
raw data
AWS Glue
DataBrew
transform
AWS Glue
DataBrew
transform
AWS Glue
transform
Glue Catalog
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue components
Crawlers
load and maintain
data catalog
infer metadata:
schema, table
structure
supports schema
evolution
Data Catalog
Apache Hive Metastore
compatible
many integrated
analytic services
Extract,
Transform, and Load
serverless execution
Apache Spark / Python
shell jobs
interactive development
auto-generate ETL code
orchestrate triggers,
crawlers and jobs
build and monitor
complex flows
integrated alerting
Workflow
Management
© 2021, Amazon Web Services, Inc. or its Affiliates.
Unified Data Catalog with Automated Schema Discovery
Breaking down data silos with a unified metadata catalog for the entire data landscape
OLTP ERP CRM
Data Warehouse
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors
Automated Schema discovery and management
Transactional
systems
Structured and Semi-Structured
discovery (Glue Crawlers)
No movement of data = Low
Costs/Admin
All metadata centrally available for
search and query = Productivity
Automate data discovery = Productivity
Unify structured, semi-structured data
= Speed to Insight
Machine Learning
DW
Queries
Big data
processing
Interactive Real-time Business
Intelligence
Data Catalog
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Workflows
Inbuilt Scheduler to orchestrate jobs
Multiple triggering mechanisms
§ Schedule-based: e.g., time of day
§ Event-based: e.g., job completion
§ On-demand: e.g., AWS Lambda
Easy to access logs and monitor
progress
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue 2.0 Engine
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Dynamic Frames
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Transforms – Relationalize (example)
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue Connectors
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue – Build Your Own Connector
© 2021, Amazon Web Services, Inc. or its Affiliates.
Glue – Clean and prepare real time and batch data
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Glue Studio: New visual ETL interface
M A K E S I T E A S Y T O A U T H O R , R U N , A N D M O N I T O R A W S G L U E E T L J O B S
Author AWS Glue jobs visually without coding
Monitor 1000s of jobs through a single pane of glass
Distributed processing without the learning curve
Advanced transforms though code snippets
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Glue Studio
© 2021, Amazon Web Services, Inc. or its Affiliates.
“Our teams spend too much time on the
undifferentiated, repetitive, and
mundane tasks associated with data
preparation.”
© 2021, Amazon Web Services, Inc. or its Affiliates.
Extraction & loading
Cleaning &
normalization
Orchestrating
at scale
Preparing data involves several complex tasks
Needs a lot of code-based heavy-lifting to work at scale
© 2021, Amazon Web Services, Inc. or its Affiliates.
As much as 80% of time is spent preparing data today
Needs the right tool for the right persona
© 2021, Amazon Web Services, Inc. or its Affiliates.
Challenges with traditional data preparation
Time consuming
Needs the right tools for the right persona that are integrated
Manual
Needs a lot of code-based heavy-lifting for it to work at scale
Siloed
Often requires moving large amounts of data into silos, at times out of VPCs
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Clean and normalize data up to 80% faster
© 2021, Amazon Web Services, Inc. or its Affiliates.
Built for data analysts and data scientists
Clean and
normalize data
Over 250 built-in
transformations
Understand
data quality
Understand patterns
and detect anomalies
using profiles
Visually map
data lineage
Understand steps
that the data has
been through
Automate
at scale
Save transformations
and apply to new data
as it comes in
Data preparation made easy
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Glue DataBrew
© 2021, Amazon Web Services, Inc. or its Affiliates.
What we saw in the demo
Build a recipe
Profile the data Run a job
Operationalize at scale
Schedule jobs Use APIs/SDK Reuse recipes
© 2021, Amazon Web Services, Inc. or its Affiliates.
Popular use cases
© 2021, Amazon Web Services, Inc. or its Affiliates.
One-time data analysis for business reporting
Amazon S3
AWS Glue
DataBrew
Amazon QuickSight
Amazon S3
output bucket
Amazon Redshift
Amazon RDS
Data catalog
data sources
Amazon Simple
Storage Service
(Amazon S3)
Local file
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon Simple
Notification Service
Amazon EventBridge
Email notification
AWS Lambda
Amazon S3
AWS Glue
DataBrew
Recurring raw
data feed
Set up data quality rules with AWS Lambda
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data preprocessing for machine learning
Amazon S3 AWS Glue
DataBrew
JupyterLab environment
Inference
Amazon S3
output bucket
Model training
© 2021, Amazon Web Services, Inc. or its Affiliates.
Orchestrating data preparation in workflows
AWS Step Functions workflow
AWS Glue
DataBrew
AWS Glue
Data catalog
Amazon Redshift
Crawler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Data Wrangler
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Demo
AWS Data Wrangler
41
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker
Data Wrangler
The fastest and easiest
way to prepare data for
machine learning
Support for data from multiple sources
Quickly select and query data
Use built-in data transformations to covert raw data to features for machine
learning
Easily transform data with built-in data transformations
Complete flexibility to bring your own custom transformations in in PySpark, SQL,
or Pandas
Customize data transformations
Quickly detect outliers or extreme values – all without writing code
Understand data visually
Diagnose potential issues in data preparation workflows that could hinder ML model
accuracy
Quickly estimate ML model accuracy
Deploy data preparation workflows into production with a
single click
Manage all steps of the data preparation workflow through a single visual interface to
quickly operationalize workflows into production settings
42
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker Data Wrangler
Use Cases
Cleanse & Explore Data
Use built-in data transformations to
accelerate data cleansing and
exploration
Visualize & Understand Data Enrich Data
Quickly detect outliers or
extreme values within a data set
without the need to write code
Use built-in data transformation tools to
transform data into formats that can be
used to build accurate ML models
43
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Quickly select and query data
Select data from Amazon Athena,
Amazon Redshift, AWS Lake
Formation, Amazon S3, and features
from SageMaker Feature Store
Write queries for data sources before
importing data over to SageMaker
Data Wrangler
Import data in various file formats,
such as CSV files, Parquet files, and
database tables directly into Amazon
SageMaker
44
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Easily transform data
Transform your data without writing a
single line of code using over 300 built-in
data transformations
Built-in data transformations include
convert column type, rename column, and
delete column
Author custom transformations in
PySpark, SQL, and Pandas
45
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Understand your data visually
Intuitively understand your data with a set
of pre-configured visualization templates
Pre-configured visualization templates
include histograms, scatter plots, box and
whisker plots, line plots, and bar charts
Interactively create and edit your own
visualizations so you can quickly detect
outliers or extreme values
46
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Quickly estimate model accuracy
Identify inconsistencies in data
preparation workflows and diagnose
issues before ML models are deployed
into production
Select subsets of data to identify errors
Identify which features are contributing
to model performance relative to others
Determine if additional feature
engineering is needed to improve model
performance
47
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Deploy data preparation workflows into production
Export data preparation workflows to a
notebook or Python code
Integrate your workflow with
SageMaker Pipelines to automate
model deployment and management
Publish created features to SageMaker
Feature Store for reuse and syndication
across teams and projects
48
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved |
SageMaker Data Wrangler pricing and availability
Generally Available
Priced per instance
usage
Available in all regions
where SageMaker
Studio is available
© 2021, Amazon Web Services, Inc. or its Affiliates.
Thank you
Francesco Marelli
Senior Solutions Architect
https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli

More Related Content

What's hot

Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
Amazon Web Services
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Cathrine Wilhelmsen
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
confluent
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
On-premise to Microsoft Azure Cloud Migration.
 On-premise to Microsoft Azure Cloud Migration. On-premise to Microsoft Azure Cloud Migration.
On-premise to Microsoft Azure Cloud Migration.
Emtec Inc.
 
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results Amazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Amazon Web Services
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Chris Bingham
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
AWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and WorkshopsAWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and Workshops
Tom Laszewski
 

What's hot (20)

Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
AWS Secrets Manager: Best Practices for Managing, Retrieving, and Rotating Se...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
On-premise to Microsoft Azure Cloud Migration.
 On-premise to Microsoft Azure Cloud Migration. On-premise to Microsoft Azure Cloud Migration.
On-premise to Microsoft Azure Cloud Migration.
 
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results
Cloud Adoption Framework Define Your Cloud Strategy and Accelerate Results
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
AWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and WorkshopsAWS Cloud Adoption Framework and Workshops
AWS Cloud Adoption Framework and Workshops
 

Similar to Speed up data preparation for ML pipelines on AWS

Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdfSederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
Jazzy44
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
Ahmed791434
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
SwathiPonugumati
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
Lam Le
 
20210608 - Desarrollo de aplicaciones en la nube
20210608 - Desarrollo de aplicaciones en la nube20210608 - Desarrollo de aplicaciones en la nube
20210608 - Desarrollo de aplicaciones en la nube
Marcia Villalba
 
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Datavail
 
Single View of Data
Single View of DataSingle View of Data
Single View of Data
confluent
 
Big dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummitBig dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummit
Amazon Web Services LATAM
 
Building Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWSBuilding Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWS
confluent
 
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
Amazon Web Services
 
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
Amazon Web Services
 
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
Amazon Web Services
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
AWS Advanced Analytics Automation Toolkit (AAA)
AWS Advanced Analytics Automation Toolkit (AAA)AWS Advanced Analytics Automation Toolkit (AAA)
AWS Advanced Analytics Automation Toolkit (AAA)
CloudHesive
 
What can you do with Serverless in 2020
What can you do with Serverless in 2020What can you do with Serverless in 2020
What can you do with Serverless in 2020
Boaz Ziniman
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Amazon Web Services
 
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
Amazon Web Services
 
The Future of Mainframe Is in the Cloud
The Future of Mainframe Is in the CloudThe Future of Mainframe Is in the Cloud
The Future of Mainframe Is in the Cloud
Precisely
 
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
Amazon Web Services
 
Look Before You Leap: Migrating On-Premises Hadoop to AWS
Look Before You Leap: Migrating On-Premises Hadoop to AWSLook Before You Leap: Migrating On-Premises Hadoop to AWS
Look Before You Leap: Migrating On-Premises Hadoop to AWS
DevOps.com
 

Similar to Speed up data preparation for ML pipelines on AWS (20)

Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdfSederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
Sederhanakan_integrasi_data_anda_dengan_AWS_Glue_handout.pdf
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
20210608 - Desarrollo de aplicaciones en la nube
20210608 - Desarrollo de aplicaciones en la nube20210608 - Desarrollo de aplicaciones en la nube
20210608 - Desarrollo de aplicaciones en la nube
 
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
Lessons from Migrating Oracle Databases to Amazon RDS or Amazon Aurora
 
Single View of Data
Single View of DataSingle View of Data
Single View of Data
 
Big dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummitBig dataandhp cforawsbrasilsummit
Big dataandhp cforawsbrasilsummit
 
Building Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWSBuilding Modern Streaming Analytics with Confluent on AWS
Building Modern Streaming Analytics with Confluent on AWS
 
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
Realize Value, Reduce Costs And Optimize the Value of Your Microsoft Investme...
 
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
Realize Value of Your Microsoft Investments - AWS Transformation Day Boston 2018
 
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
Realize Value of Your Microsoft Investments - Transformation Day Montreal 2018
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsHow to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
 
AWS Advanced Analytics Automation Toolkit (AAA)
AWS Advanced Analytics Automation Toolkit (AAA)AWS Advanced Analytics Automation Toolkit (AAA)
AWS Advanced Analytics Automation Toolkit (AAA)
 
What can you do with Serverless in 2020
What can you do with Serverless in 2020What can you do with Serverless in 2020
What can you do with Serverless in 2020
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
 
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
Realize Value of Your Microsoft Investments - AWS Transformation Days Raleigh...
 
The Future of Mainframe Is in the Cloud
The Future of Mainframe Is in the CloudThe Future of Mainframe Is in the Cloud
The Future of Mainframe Is in the Cloud
 
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
AWS re:Invent 2016: Relational and NoSQL Databases on AWS: NBC, MarkLogic, an...
 
Look Before You Leap: Migrating On-Premises Hadoop to AWS
Look Before You Leap: Migrating On-Premises Hadoop to AWSLook Before You Leap: Migrating On-Premises Hadoop to AWS
Look Before You Leap: Migrating On-Premises Hadoop to AWS
 

More from Data Science Milan

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
Data Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
Data Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
Data Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
Data Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
Data Science Milan
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Data Science Milan
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
Data Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Data Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
Data Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
Data Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Data Science Milan
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Data Science Milan
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
Data Science Milan
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
Data Science Milan
 

More from Data Science Milan (20)

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
 
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 

Recently uploaded

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Speed up data preparation for ML pipelines on AWS

  • 1. © 2021, Amazon Web Services, Inc. or its Affiliates. Speed up data preparation for ML pipelines on AWS Francesco Marelli Senior Solutions Architect https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli Data Science Milan Meetup 21 April 2021
  • 2. © 2021, Amazon Web Services, Inc. or its Affiliates. Customers moving from traditional data warehouse approach Data silos to OLTP ERP CRM LOB DW Silo 1 Business Intelligence Devices Web Sensors Social DW Silo 2 Business Intelligence Data Lake Non- relational databases Machine learning Data warehousing Log analytics Big data processing Relational databases
  • 3. © 2021, Amazon Web Services, Inc. or its Affiliates. Lake House architecture on AWS Scalable data lakes Purpose-built data services Seamless Data movement Unified governance Performant and cost-effective Amazon DynamoDB Amazon SageMaker Amazon Redshift Amazon Elasticsearch Service Amazon EMR Amazon Aurora Amazon Athena Amazon S3
  • 4. © 2021, Amazon Web Services, Inc. or its Affiliates. The AWS analytics portfolio Data movement Analytics Data lake infrastructure & management Data, visualization, engagement, & machine learning + many more Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics AWS Glue (Spark & Python) S3/Glacier AWS Glue Lake Formation QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Database Migration Service | Snowball | Snowmobile | Kinesis Data Streams | Kinesis Data Firehose | Managed Streaming for Apache Kafka Pinpoint Data Exchange
  • 5. © 2021, Amazon Web Services, Inc. or its Affiliates. Accelerate your predictive analytics & machine learning journey Broadest and most complete set of machine learning capabilities Amazon SageMaker VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Trainium Inferentia FPGA AI SERVICES ML SERVICES FRAMEWORKS & INFRASTRUCTURE DeepGraphLibrary Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Lex Amazon Personalize Amazon Forecast Amazon Comprehend +Medical Amazon Textract Amazon Kendra Amazon CodeGuru Amazon Fraud Detector Amazon Translate INDUSTRIAL AI CODE AND DEVOPS NEW Amazon DevOps Guru Voice ID For Amazon Connect Contact Lens NEW Amazon Monitron NEW AWS Panorama + Appliance NEW Amazon Lookout for Vision NEW Amazon Lookout for Equipment NEW Amazon HealthLake HEALTH AI NEW Amazon Lookout for Metrics ANOMALY DETECTION Amazon Transcribe for Medical Amazon Comprehend for Medical Label data NEW Aggregate & prepare data NEW Store & share features Auto ML Spark/R NEW Detect bias Visualize in notebooks Pick algorithm Train models Tune parameters NEW Debug & profile Deploy in production Manage & monitor NEW CI/CD Human review NEW: Model management for edge devices NEW: SageMaker JumpStart SAGEMAKER STUDIO IDE
  • 6. © 2021, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker: Built to make ML more accessible Pick algorithm Visualize in notebooks Label data Collect and prepare data Store features Check data Train models Tune parameters Deploy in production Manage and monitor CI/CD SageMaker Studio IDE
  • 7. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Glue Serverless Data Integration for Complex Workloads Serverless There is no infrastructure to maintain. Allocate needed compute power and run jobs. Cost-effective All-in-one pricing model includes infrastructure and is 55% cheaper than other cloud data integration options Handles complex workloads Glue connects to hundreds of data sources, processes petabytes of data in real- time, batch and event driven modes No lock in Develop data integration pipelines in open source SparkSQL, PySpark and Scala Data Integration for every user Development environments catered to different skillsets - visual ETL development for Data Engineers, notebook styled development for Data Scientists and no code development for Data Analysts
  • 8. © 2021, Amazon Web Services, Inc. or its Affiliates. How customers use AWS Glue Prepare data for Machine Learning Migrate from expensive traditional ETL solutions to gain flexibility and reduce costs Process petabytes of data both in batch and real- time using Apache Spark Build Data Lakes and Lake Houses for scalable data analysis Catalog data assets to make them available to AWS Analytics services
  • 9. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue: data integration platform for building Lake Houses faster Connect Amazon RDS Other databases On-premises data Streaming data Connect to data sources using Glue Connector Catalog Catalog Streaming data in Glue Schema Registry Catalog structured and semi structured Data in Glue Catalog Discover Schema with Glue Crawlers Transform Transform without writing code using Glue Databrew Interactively transform data using Dev Endpoints Visually transform data using Glue Studio Easily replicate data across Lakehouse with Glue Elastic View LAKE HOUSE Data lake NoSQL Data Warehouse Log Analytics Big Data Relational Machine Learning SaaS
  • 10. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue is used to modernize on premises ETL tools
  • 11. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue is used to prepare raw data for Machine Learning Logs, app data Amazon RDS Other databases On-premises data Streaming data AWS Glue ingest cleaned and enriched data extracted features training data Notebooks: data exploration, experimentation raw data AWS Glue DataBrew transform AWS Glue DataBrew transform AWS Glue transform Glue Catalog
  • 12. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Glue components Crawlers load and maintain data catalog infer metadata: schema, table structure supports schema evolution Data Catalog Apache Hive Metastore compatible many integrated analytic services Extract, Transform, and Load serverless execution Apache Spark / Python shell jobs interactive development auto-generate ETL code orchestrate triggers, crawlers and jobs build and monitor complex flows integrated alerting Workflow Management
  • 13. © 2021, Amazon Web Services, Inc. or its Affiliates. Unified Data Catalog with Automated Schema Discovery Breaking down data silos with a unified metadata catalog for the entire data landscape OLTP ERP CRM Data Warehouse Data Lake 100110000100101011100 101010111001010100001 011111011010 0011110010110010110 0100011000010 Devices Web Sensors Automated Schema discovery and management Transactional systems Structured and Semi-Structured discovery (Glue Crawlers) No movement of data = Low Costs/Admin All metadata centrally available for search and query = Productivity Automate data discovery = Productivity Unify structured, semi-structured data = Speed to Insight Machine Learning DW Queries Big data processing Interactive Real-time Business Intelligence Data Catalog
  • 14. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue Workflows Inbuilt Scheduler to orchestrate jobs Multiple triggering mechanisms § Schedule-based: e.g., time of day § Event-based: e.g., job completion § On-demand: e.g., AWS Lambda Easy to access logs and monitor progress Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 15. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Glue 2.0 Engine
  • 16. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue Dynamic Frames
  • 17. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue Transforms – Relationalize (example)
  • 18. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue Connectors
  • 19. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue – Build Your Own Connector
  • 20. © 2021, Amazon Web Services, Inc. or its Affiliates. Glue – Clean and prepare real time and batch data
  • 21. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Glue Studio: New visual ETL interface M A K E S I T E A S Y T O A U T H O R , R U N , A N D M O N I T O R A W S G L U E E T L J O B S Author AWS Glue jobs visually without coding Monitor 1000s of jobs through a single pane of glass Distributed processing without the learning curve Advanced transforms though code snippets
  • 22. © 2021, Amazon Web Services, Inc. or its Affiliates. © 2021, Amazon Web Services, Inc. or its Affiliates. Demo AWS Glue Studio
  • 23. © 2021, Amazon Web Services, Inc. or its Affiliates. “Our teams spend too much time on the undifferentiated, repetitive, and mundane tasks associated with data preparation.”
  • 24. © 2021, Amazon Web Services, Inc. or its Affiliates. Extraction & loading Cleaning & normalization Orchestrating at scale Preparing data involves several complex tasks Needs a lot of code-based heavy-lifting to work at scale
  • 25. © 2021, Amazon Web Services, Inc. or its Affiliates. As much as 80% of time is spent preparing data today Needs the right tool for the right persona
  • 26. © 2021, Amazon Web Services, Inc. or its Affiliates. Challenges with traditional data preparation Time consuming Needs the right tools for the right persona that are integrated Manual Needs a lot of code-based heavy-lifting for it to work at scale Siloed Often requires moving large amounts of data into silos, at times out of VPCs
  • 27. © 2021, Amazon Web Services, Inc. or its Affiliates.
  • 28. © 2021, Amazon Web Services, Inc. or its Affiliates. Clean and normalize data up to 80% faster
  • 29. © 2021, Amazon Web Services, Inc. or its Affiliates. Built for data analysts and data scientists Clean and normalize data Over 250 built-in transformations Understand data quality Understand patterns and detect anomalies using profiles Visually map data lineage Understand steps that the data has been through Automate at scale Save transformations and apply to new data as it comes in Data preparation made easy
  • 30. © 2021, Amazon Web Services, Inc. or its Affiliates. Demo AWS Glue DataBrew
  • 31. © 2021, Amazon Web Services, Inc. or its Affiliates. What we saw in the demo Build a recipe Profile the data Run a job Operationalize at scale Schedule jobs Use APIs/SDK Reuse recipes
  • 32. © 2021, Amazon Web Services, Inc. or its Affiliates. Popular use cases
  • 33. © 2021, Amazon Web Services, Inc. or its Affiliates. One-time data analysis for business reporting Amazon S3 AWS Glue DataBrew Amazon QuickSight Amazon S3 output bucket Amazon Redshift Amazon RDS Data catalog data sources Amazon Simple Storage Service (Amazon S3) Local file
  • 34. © 2021, Amazon Web Services, Inc. or its Affiliates. Amazon Simple Notification Service Amazon EventBridge Email notification AWS Lambda Amazon S3 AWS Glue DataBrew Recurring raw data feed Set up data quality rules with AWS Lambda
  • 35. © 2021, Amazon Web Services, Inc. or its Affiliates. Data preprocessing for machine learning Amazon S3 AWS Glue DataBrew JupyterLab environment Inference Amazon S3 output bucket Model training
  • 36. © 2021, Amazon Web Services, Inc. or its Affiliates. Orchestrating data preparation in workflows AWS Step Functions workflow AWS Glue DataBrew AWS Glue Data catalog Amazon Redshift Crawler
  • 37. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Data Wrangler
  • 38. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Data Wrangler
  • 39. © 2021, Amazon Web Services, Inc. or its Affiliates. AWS Data Wrangler
  • 40. © 2021, Amazon Web Services, Inc. or its Affiliates. © 2021, Amazon Web Services, Inc. or its Affiliates. Demo AWS Data Wrangler
  • 41. 41 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler The fastest and easiest way to prepare data for machine learning Support for data from multiple sources Quickly select and query data Use built-in data transformations to covert raw data to features for machine learning Easily transform data with built-in data transformations Complete flexibility to bring your own custom transformations in in PySpark, SQL, or Pandas Customize data transformations Quickly detect outliers or extreme values – all without writing code Understand data visually Diagnose potential issues in data preparation workflows that could hinder ML model accuracy Quickly estimate ML model accuracy Deploy data preparation workflows into production with a single click Manage all steps of the data preparation workflow through a single visual interface to quickly operationalize workflows into production settings
  • 42. 42 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler Use Cases Cleanse & Explore Data Use built-in data transformations to accelerate data cleansing and exploration Visualize & Understand Data Enrich Data Quickly detect outliers or extreme values within a data set without the need to write code Use built-in data transformation tools to transform data into formats that can be used to build accurate ML models
  • 43. 43 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Quickly select and query data Select data from Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon S3, and features from SageMaker Feature Store Write queries for data sources before importing data over to SageMaker Data Wrangler Import data in various file formats, such as CSV files, Parquet files, and database tables directly into Amazon SageMaker
  • 44. 44 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Easily transform data Transform your data without writing a single line of code using over 300 built-in data transformations Built-in data transformations include convert column type, rename column, and delete column Author custom transformations in PySpark, SQL, and Pandas
  • 45. 45 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Understand your data visually Intuitively understand your data with a set of pre-configured visualization templates Pre-configured visualization templates include histograms, scatter plots, box and whisker plots, line plots, and bar charts Interactively create and edit your own visualizations so you can quickly detect outliers or extreme values
  • 46. 46 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Quickly estimate model accuracy Identify inconsistencies in data preparation workflows and diagnose issues before ML models are deployed into production Select subsets of data to identify errors Identify which features are contributing to model performance relative to others Determine if additional feature engineering is needed to improve model performance
  • 47. 47 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | Deploy data preparation workflows into production Export data preparation workflows to a notebook or Python code Integrate your workflow with SageMaker Pipelines to automate model deployment and management Publish created features to SageMaker Feature Store for reuse and syndication across teams and projects
  • 48. 48 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved | SageMaker Data Wrangler pricing and availability Generally Available Priced per instance usage Available in all regions where SageMaker Studio is available
  • 49. © 2021, Amazon Web Services, Inc. or its Affiliates. Thank you Francesco Marelli Senior Solutions Architect https://www.linkedin.com/in/marellifrancesco/ - @frankmarelli