SlideShare a Scribd company logo
1 of 40
Download to read offline
Serverless Data Lake
on AWS
Thanh Nguyen
Senior Solutions Architect
① Workshop Lab Guide >> Get Ready!
AWS Account: Glue, Athena, QuickSight
Step-by-Step Lab Guide
AWS CloudFormation: provision your AWS Cloud resources
Amazon SageMaker ETL Notebook
NYC-Taxi Create Optimized Dataset Job
Amazon SageMaker XGBoost ETL Notebook
SERVERLESS DATA LAKE ON AWS 16/4/20 2
② Workshop Lab Guide >> Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 3
① Workshop Lab Guide >> Get Ready!
q Data Analytics at Unicorn-Taxi-Company
q Your dataset: NYC Taxi Trips
q Simplified Raw Dataset vs. Original Raw Dataset
SERVERLESS DATA LAKE ON AWS 16/4/20 4
Data Analytics at Unicorn-Taxi Company
Create a Dataset for
Reporting and Visualization
Cleanse
Transform
Optimize for reporting queries
Help solve a
Machine Learning problem
Unicorn-Taxi’s Data Scientists
need to understand
passenger tipping behavior
SERVERLESS DATA LAKE ON AWS 16/4/20 5
Your dataset: NYC Taxi Trips
Yellow and Green taxi trips
§ Pick-up and drop-off Dates/Times
§ Pick-up and drop-off Locations
§ Trip Distances
§ Itemized Fares
§ Tip Amount
§ Driver-reported Passenger Counts
For-Hire Vehicle (FHV)
§ Pick-up and drop-off Dates/Times
§ Pick-up and drop-off Locations
§ Dispatching base
SERVERLESS DATA LAKE ON AWS 16/4/20 6
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Your dataset: NYC Taxi Trips
Original Raw Dataset
Green and Yellow Taxi + FHV
Years 2009 to 2018
~1.6Bn rows
215 files
253 GB total
Simplified Raw Dataset
Only Yellow Taxi + few look-ups
Jan to March 2017
~2M rows
3 files
2.5 GB+ uncompressed
Ready in a publicly accessible
Amazon S3 bucket
SERVERLESS DATA LAKE ON AWS 16/4/20 7
① Workshop Lab Guide >> Get Ready!
q Navigate to the Serverless Data Lake Workshop website
SERVERLESS DATA LAKE ON AWS 16/4/20 8
① Workshop Lab Guide >> Get Ready!
Setup Prerequisite Resources using AWS CloudFormation
1. A new Data Lake Amazon S3 bucket
2. Necessary IAM Policies & Roles for AWS Glue, Amazon Athena, and Amazon SageMaker
3. An AWS Glue Development Endpoint
4. A number of named queries in Amazon Athena
Finally, the NYC Taxi Trips Raw Dataset is copied into your Amazon S3 bucket
SERVERLESS DATA LAKE ON AWS 16/4/20 9
② Workshop Lab Guide - Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 10
AWS Glue Crawlers and the AWS Glue Data Catalog
Crawler …
§ Samples and Classifies your data
§ Extracts MetaData
§ Infers Schema and partitioning format
§ Creates Tables in your account’s AWS Glue Data Catalog
Why Query with Amazon Athena?
• Interactive query service
• Makes it easy to analyze data in Amazon S3 using standard SQL
• Out-of-the-box integrated with AWS Glue Data Catalog
• Serverless
AWS Glue
Crawler
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
1
Concepts in Action
1. Crawler crawls Raw Dataset in
Amazon S3 Bucket
2. Crawler writes metadata into
AWS Glue Data Catalog
3. You query Raw Dataset in
Athena
4. Athena uses Schema definition to
read Raw Dataset from S3 and returns results
2
34
You
Hand-ons
2.1.1. Catalog Raw Data with an
AWS Glue Crawler
2.1.2. Explore Table Schema and
Metadata in AWS Glue Data Catalog
2.1.3. Query Raw Data with Amazon
Athena
SERVERLESS DATA LAKE ON AWS 16/4/20 14
② Workshop Lab Guide - Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 15
Interactive ETL Development with Glue & SageMaker
What is Amazon SageMaker?
§ ... is a fully-managed ML Platform
§ … enables you to build, train, and deploy
Machine Learning Models at any scale
§ … provides a Jupyter Notebook environment
Connect your IDE to an AWS Glue Development Endpoint.
Environment to interactively develop, debug, and test ETLcode.
AWS Glue Spark environment
Interpreter
Server
Remote
Interpreter
What is an AWS Glue Development Endpoint?
Interactive ETL Development with Glue & SageMaker
AWS Glue enables you to
q… create an Amazon SageMaker Notebook environment
q... connect it to an AWS Glue Dev Endpoint
q… and finally, access your Jupyter notebook
All without leaving the AWS Glue console
AWS Glue
Data Catalog
NYC Taxi
Raw Dataset
3
Concepts in Action
1. You author ETL code in an
Amazon SageMaker Notebook
2. ETL code runs on
AWS Glue Dev Endpoint
3. ETL code…
a. Reads Raw Dataset
b. Applies transformations
c. Writes Optimized Dataset back to
your Amazon S3 bucket
4. You use your ETL code in
notebook to create a script file
2
1
Amazon SageMaker
Notebook
NYC Taxi
Optimized Dataset
4
You
Hand-ons
2.2.1. Create an Amazon SageMaker
Notebook instance
2.2.2. Interactively author and run ETL
code in Jupyter
SERVERLESS DATA LAKE ON AWS 16/4/20 20
Transformations we’ll apply to Raw Data
§ Join NYC Trips Dataset with look-up tables (denormalization)
§ Create new Timestamp columns for pick-up & drop-off
§ Drop unnecessary columns
§ Partition the dataset
§ Convert to a columnar file format (parquet)
② Workshop Lab Guide - Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 23
Concepts in Action
1. Crawler crawls optimized dataset in
Amazon S3 Bucket
2. Crawler writes metadata into
AWS Glue Data Catalog
3. You query optimized dataset in Athena
4. Athena uses schema definition to read
data from Amazon S3 and returns results
NYC Taxi
Optimized Dataset
AWS Glue
Crawler
AWS Glue
Data Catalog
1
2
34
You
Hand-ons
2.3.1. Catalog Optimized Data with an
AWS Glue Crawler
2.3.2. Query Optimized Data with
Amazon Athena
SERVERLESS DATA LAKE ON AWS 16/4/20 26
② Workshop Lab Guide - Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 27
What is an AWS Glue Job?
An AWS Glue Job encapsulates the business logic that performs extract,
transform, and load (ETL) work.
§ A core building block in your production ETL pipeline
§ Provide your PySpark ETL script or have one automatically generated
§ Supports a rich set of built-in AWS Glue transformations
§ Jobs can be started, stopped, monitored
What is an AWS Glue trigger?
Triggers are the ‘Glue’ in your AWS Glue ETL pipeline.
§ Can be used to chain multiple AWS Glue Jobs in a series
§ Can start multiple Jobs at once
§ Can be Scheduled, On-demand, or based on Job Events
§ Can pass unique Parameters to customize AWS Glue job runs
Three ways to set up an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State Machine-driven
Schedule-driven AWS Glue ETL pipeline
We work our way backwards from a daily SLA deadline
New Raw
Data arrives
< 06:00 06:00
Crawl
Raw Dataset
Run
Optimize Job
06:15
Crawl
Optimized Dataset
07:00
SLA
deadline
08:00
Ready
for Reporting
07:30
15m 45m 30m 2h 30m
Event-driven AWS Glue ETL pipeline
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline.
< 06:00 Start
Crawler
Start
Job or Trigger
Start
Crawler
08:00Reporting
Dataset
ready
Data arrives
in S3
Crawler
succeeds
Job
succeeds
New Raw
Data arrives
Crawl
Raw Dataset
Run
Optimize Job
Crawl
Optimized Dataset
SLA
deadline
Ready
for Reporting
State Machine-driven AWS Glue ETL pipeline
qLet AWS Step Functions drive the pipeline.
Concepts in Action
We’ll build a Schedule-driven pipeline.
1. Scheduled crawler runs on Raw Dataset
and updates AWS Glue Data Catalog
2. AWS Glue job starts based on Trigger(s)
3. Job reads Raw Dataset, applies
transformations, and writes Optimized
Dataset to your S3 bucket
4. Scheduled crawler runs on Optimized
Dataset and updates Data Catalog
NYC Taxi
Optimized Dataset
AWS Glue
Crawler for
Raw dataset
AWS Glue
Data
Catalog
1
4
AWS Glue Job to
created
Optimized Dataset
NYC Taxi
Raw Dataset
AWS Glue
Crawler for
Optimized dataset
3
2
Trigger
Hand-ons
2.4.1. Schedule AWS Glue Crawlers
2.4.2. Create an AWS Glue Job
2.4.3. Create an AWS Glue trigger to run Jobs
2.4.4. AWS Glue Job Bookmarks
2.4. Setup an AWS Glue ETL pipeline
SERVERLESS DATA LAKE ON AWS 16/4/20 35
② Workshop Lab Guide - Practice & Learn
16/4/20SERVERLESS DATA LAKE ON AWS 36
ML Problem Definition
Tip given to Taxi Drivers is substantial part of their Income
They can also serve as loose metric for Service Quality that can be useful for
taxi companies
What are influences for higher tips (e.g., Taxi starting and ending in Richer
part of cities?)
We present analysis of factors affecting Tips and use these factors to predict
Tips
ML Architecture
Results
transform()
Test Data
fit()
XGBoost-Regression
Data Splitting
Feature Engineering
Endpoint
Endpoint Configuration
Model Creation
Training Job
Amazon
SageMaker
Notebook Instance – Glue Spark
Concepts in Action
1. Launch Notebook
2. Read Optimized Dataset
3. Train the ML Model
4. Host the trained model and
perform Prediction
NYC Taxi
Optimized
AWS Glue
Data Catalog
AWS Glue Development Endpoint
SageMaker
Notebook
2
3
4
1
Hand-ons
3.1. Interactively develop a
Machine Learning model in Jupyter
SERVERLESS DATA LAKE ON AWS 16/4/20 40
Run ML Notebook interactively
• Initialize variables and import spark libraries
• Retrieve Optimized NYC Taxi trips Dataset
• Observe features against target label (tip_amount)
• Perform feature engineering
• Split feature engineered dataset into Train and Test
• Launch Spark XGBoostEstimator (Observe hyperparameters)
• Perform Prediction and observe accuracy
You made it!

More Related Content

What's hot

Advanced AWS techniques from the trenches of the Enterprise – Sourced Group
Advanced AWS techniques from the trenches of the Enterprise – Sourced GroupAdvanced AWS techniques from the trenches of the Enterprise – Sourced Group
Advanced AWS techniques from the trenches of the Enterprise – Sourced GroupAmazon Web Services
 
SRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and DockerSRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and DockerAmazon Web Services
 
Reducing Latency and Increasing Performance while Cutting Infrastructure Costs
Reducing Latency and Increasing Performance while Cutting Infrastructure CostsReducing Latency and Increasing Performance while Cutting Infrastructure Costs
Reducing Latency and Increasing Performance while Cutting Infrastructure CostsAmazon Web Services
 
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...Amazon Web Services
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...Amazon Web Services
 
ENT303 Another Day, Another Billion Packets
ENT303 Another Day, Another Billion PacketsENT303 Another Day, Another Billion Packets
ENT303 Another Day, Another Billion PacketsAmazon Web Services
 
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPCAmazon Web Services
 
Build a Website on AWS for Your First 10 Million Users
Build a Website on AWS for Your First 10 Million UsersBuild a Website on AWS for Your First 10 Million Users
Build a Website on AWS for Your First 10 Million UsersAmazon Web Services
 
Getting Started with Docker on AWS
Getting Started with Docker on AWSGetting Started with Docker on AWS
Getting Started with Docker on AWSAmazon Web Services
 
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...Amazon Web Services
 
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)Amazon Web Services
 
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...Amazon Web Services
 
SRV203 Getting Started with AWS Lambda and the Serverless Cloud
SRV203 Getting Started with AWS Lambda and the Serverless CloudSRV203 Getting Started with AWS Lambda and the Serverless Cloud
SRV203 Getting Started with AWS Lambda and the Serverless CloudAmazon Web Services
 
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...Amazon Web Services
 
Rackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWSRackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWSAmazon Web Services
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraAmazon Web Services
 
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...Amazon Web Services
 
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)Amazon Web Services
 
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...Amazon Web Services
 

What's hot (20)

Advanced AWS techniques from the trenches of the Enterprise – Sourced Group
Advanced AWS techniques from the trenches of the Enterprise – Sourced GroupAdvanced AWS techniques from the trenches of the Enterprise – Sourced Group
Advanced AWS techniques from the trenches of the Enterprise – Sourced Group
 
SRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and DockerSRV409 Deep Dive on Microservices and Docker
SRV409 Deep Dive on Microservices and Docker
 
Reducing Latency and Increasing Performance while Cutting Infrastructure Costs
Reducing Latency and Increasing Performance while Cutting Infrastructure CostsReducing Latency and Increasing Performance while Cutting Infrastructure Costs
Reducing Latency and Increasing Performance while Cutting Infrastructure Costs
 
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...
AWS re:Invent 2016: Using AWS Lambda to Build Control Systems for Your AWS In...
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
 
ENT303 Another Day, Another Billion Packets
ENT303 Another Day, Another Billion PacketsENT303 Another Day, Another Billion Packets
ENT303 Another Day, Another Billion Packets
 
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
(NET307) Pinterest: The road from EC2-Classic To EC2-VPC
 
Build a Website on AWS for Your First 10 Million Users
Build a Website on AWS for Your First 10 Million UsersBuild a Website on AWS for Your First 10 Million Users
Build a Website on AWS for Your First 10 Million Users
 
Introduction to AWS X-Ray
Introduction to AWS X-RayIntroduction to AWS X-Ray
Introduction to AWS X-Ray
 
Getting Started with Docker on AWS
Getting Started with Docker on AWSGetting Started with Docker on AWS
Getting Started with Docker on AWS
 
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...
Real World Development: Peeling The Onion – Migrating A Monolithic Applicatio...
 
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
AWS re:Invent 2016: Wild Rydes Takes Off – The Dawn of a New Unicorn (SVR309)
 
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...
AWS re:Invent 2016: Serverless Architectural Patterns and Best Practices (ARC...
 
SRV203 Getting Started with AWS Lambda and the Serverless Cloud
SRV203 Getting Started with AWS Lambda and the Serverless CloudSRV203 Getting Started with AWS Lambda and the Serverless Cloud
SRV203 Getting Started with AWS Lambda and the Serverless Cloud
 
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
Getting Started with Serverless Architectures - August 2016 Monthly Webinar S...
 
Rackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWSRackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWS
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...
AWS re:Invent 2016: From EC2 to ECS: How Capital One uses Application Load Ba...
 
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
AWS re:Invent 2016: Building Complex Serverless Applications (GPST404)
 
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
AWS re:Invent 2016: 6 Million New Registrations in 30 Days: How the Chick-fil...
 

Similar to Serverless Data Lake on AWS

Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Amazon Web Services
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18Neal Davis
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsMarek Kuczynski
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Getting Started with Serverless and Container Architectures
Getting Started with Serverless and Container ArchitecturesGetting Started with Serverless and Container Architectures
Getting Started with Serverless and Container ArchitecturesAmazon Web Services
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL PipelinesAmazon Web Services
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Amazon Web Services
 
Introduction to Batch Processing on AWS
Introduction to Batch Processing on AWSIntroduction to Batch Processing on AWS
Introduction to Batch Processing on AWSAmazon Web Services
 
無伺服器架構和Containers on AWS入門
無伺服器架構和Containers on AWS入門 無伺服器架構和Containers on AWS入門
無伺服器架構和Containers on AWS入門 Amazon Web Services
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Amazon Web Services
 

Similar to Serverless Data Lake on AWS (20)

Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
What is AWS Glue
What is AWS GlueWhat is AWS Glue
What is AWS Glue
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Getting Started with Serverless and Container Architectures
Getting Started with Serverless and Container ArchitecturesGetting Started with Serverless and Container Architectures
Getting Started with Serverless and Container Architectures
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS -  June 2017 AWS Online Tech TalksBatch Processing with Containers on AWS -  June 2017 AWS Online Tech Talks
Batch Processing with Containers on AWS - June 2017 AWS Online Tech Talks
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 
Introduction to Batch Processing on AWS
Introduction to Batch Processing on AWSIntroduction to Batch Processing on AWS
Introduction to Batch Processing on AWS
 
無伺服器架構和Containers on AWS入門
無伺服器架構和Containers on AWS入門 無伺服器架構和Containers on AWS入門
無伺服器架構和Containers on AWS入門
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 

More from Thanh Nguyen

Building a NFT Marketplace DApp
Building a NFT Marketplace DAppBuilding a NFT Marketplace DApp
Building a NFT Marketplace DAppThanh Nguyen
 
Serverless Architecture 101 ⚡
Serverless Architecture 101 ⚡Serverless Architecture 101 ⚡
Serverless Architecture 101 ⚡Thanh Nguyen
 
SmartChat WhatsApp-clone using AWS Amplify AppSync
SmartChat WhatsApp-clone using AWS Amplify AppSyncSmartChat WhatsApp-clone using AWS Amplify AppSync
SmartChat WhatsApp-clone using AWS Amplify AppSyncThanh Nguyen
 
Introduction to Ethereum Blockchain & Smart Contract
Introduction to Ethereum Blockchain & Smart ContractIntroduction to Ethereum Blockchain & Smart Contract
Introduction to Ethereum Blockchain & Smart ContractThanh Nguyen
 
Amazon AWS Free-Tier
Amazon AWS Free-TierAmazon AWS Free-Tier
Amazon AWS Free-TierThanh Nguyen
 
Rapid Software Development Process
Rapid Software Development ProcessRapid Software Development Process
Rapid Software Development ProcessThanh Nguyen
 
PMI ACP Classroom Question Paper
PMI ACP Classroom Question PaperPMI ACP Classroom Question Paper
PMI ACP Classroom Question PaperThanh Nguyen
 
PMI ACP Classroom Question Paper with Answers
PMI ACP Classroom Question Paper with AnswersPMI ACP Classroom Question Paper with Answers
PMI ACP Classroom Question Paper with AnswersThanh Nguyen
 
PMI-ACP Case Study
PMI-ACP Case StudyPMI-ACP Case Study
PMI-ACP Case StudyThanh Nguyen
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4PMI-ACP Lesson 12 Knowledge and Skills Nugget 4
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4Thanh Nguyen
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3PMI-ACP Lesson 12 Knowledge and Skills Nugget 3
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3Thanh Nguyen
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2PMI-ACP Lesson 12 Knowledge and Skills Nugget 2
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2Thanh Nguyen
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1PMI-ACP Lesson 12 Knowledge and Skills Nugget 1
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1Thanh Nguyen
 
PMI-ACP Lesson 11 Agile Value Stream Analysis
PMI-ACP Lesson 11 Agile Value Stream AnalysisPMI-ACP Lesson 11 Agile Value Stream Analysis
PMI-ACP Lesson 11 Agile Value Stream AnalysisThanh Nguyen
 
PMI-ACP Lesson 10 Agile Metrics
PMI-ACP Lesson 10 Agile MetricsPMI-ACP Lesson 10 Agile Metrics
PMI-ACP Lesson 10 Agile MetricsThanh Nguyen
 
PMI-ACP Lesson 9 Agile Risk Management
PMI-ACP Lesson 9 Agile Risk ManagementPMI-ACP Lesson 9 Agile Risk Management
PMI-ACP Lesson 9 Agile Risk ManagementThanh Nguyen
 
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based Prioritization
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based PrioritizationPMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based Prioritization
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based PrioritizationThanh Nguyen
 
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based Prioritization
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based PrioritizationPMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based Prioritization
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based PrioritizationThanh Nguyen
 
PMI-ACP Lesson 07 Soft Skills Negotiation
PMI-ACP Lesson 07 Soft Skills NegotiationPMI-ACP Lesson 07 Soft Skills Negotiation
PMI-ACP Lesson 07 Soft Skills NegotiationThanh Nguyen
 
PMI-ACP Lesson 06 Quality
PMI-ACP Lesson 06 QualityPMI-ACP Lesson 06 Quality
PMI-ACP Lesson 06 QualityThanh Nguyen
 

More from Thanh Nguyen (20)

Building a NFT Marketplace DApp
Building a NFT Marketplace DAppBuilding a NFT Marketplace DApp
Building a NFT Marketplace DApp
 
Serverless Architecture 101 ⚡
Serverless Architecture 101 ⚡Serverless Architecture 101 ⚡
Serverless Architecture 101 ⚡
 
SmartChat WhatsApp-clone using AWS Amplify AppSync
SmartChat WhatsApp-clone using AWS Amplify AppSyncSmartChat WhatsApp-clone using AWS Amplify AppSync
SmartChat WhatsApp-clone using AWS Amplify AppSync
 
Introduction to Ethereum Blockchain & Smart Contract
Introduction to Ethereum Blockchain & Smart ContractIntroduction to Ethereum Blockchain & Smart Contract
Introduction to Ethereum Blockchain & Smart Contract
 
Amazon AWS Free-Tier
Amazon AWS Free-TierAmazon AWS Free-Tier
Amazon AWS Free-Tier
 
Rapid Software Development Process
Rapid Software Development ProcessRapid Software Development Process
Rapid Software Development Process
 
PMI ACP Classroom Question Paper
PMI ACP Classroom Question PaperPMI ACP Classroom Question Paper
PMI ACP Classroom Question Paper
 
PMI ACP Classroom Question Paper with Answers
PMI ACP Classroom Question Paper with AnswersPMI ACP Classroom Question Paper with Answers
PMI ACP Classroom Question Paper with Answers
 
PMI-ACP Case Study
PMI-ACP Case StudyPMI-ACP Case Study
PMI-ACP Case Study
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4PMI-ACP Lesson 12 Knowledge and Skills Nugget 4
PMI-ACP Lesson 12 Knowledge and Skills Nugget 4
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3PMI-ACP Lesson 12 Knowledge and Skills Nugget 3
PMI-ACP Lesson 12 Knowledge and Skills Nugget 3
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2PMI-ACP Lesson 12 Knowledge and Skills Nugget 2
PMI-ACP Lesson 12 Knowledge and Skills Nugget 2
 
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1PMI-ACP Lesson 12 Knowledge and Skills Nugget 1
PMI-ACP Lesson 12 Knowledge and Skills Nugget 1
 
PMI-ACP Lesson 11 Agile Value Stream Analysis
PMI-ACP Lesson 11 Agile Value Stream AnalysisPMI-ACP Lesson 11 Agile Value Stream Analysis
PMI-ACP Lesson 11 Agile Value Stream Analysis
 
PMI-ACP Lesson 10 Agile Metrics
PMI-ACP Lesson 10 Agile MetricsPMI-ACP Lesson 10 Agile Metrics
PMI-ACP Lesson 10 Agile Metrics
 
PMI-ACP Lesson 9 Agile Risk Management
PMI-ACP Lesson 9 Agile Risk ManagementPMI-ACP Lesson 9 Agile Risk Management
PMI-ACP Lesson 9 Agile Risk Management
 
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based Prioritization
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based PrioritizationPMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based Prioritization
PMI-ACP Lesson 08 Nugget 2 Agile & Scrum - Value-Based Prioritization
 
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based Prioritization
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based PrioritizationPMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based Prioritization
PMI-ACP Lesson 08 Nugget 1 Agile & Scrum Value-based Prioritization
 
PMI-ACP Lesson 07 Soft Skills Negotiation
PMI-ACP Lesson 07 Soft Skills NegotiationPMI-ACP Lesson 07 Soft Skills Negotiation
PMI-ACP Lesson 07 Soft Skills Negotiation
 
PMI-ACP Lesson 06 Quality
PMI-ACP Lesson 06 QualityPMI-ACP Lesson 06 Quality
PMI-ACP Lesson 06 Quality
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Serverless Data Lake on AWS

  • 1. Serverless Data Lake on AWS Thanh Nguyen Senior Solutions Architect
  • 2. ① Workshop Lab Guide >> Get Ready! AWS Account: Glue, Athena, QuickSight Step-by-Step Lab Guide AWS CloudFormation: provision your AWS Cloud resources Amazon SageMaker ETL Notebook NYC-Taxi Create Optimized Dataset Job Amazon SageMaker XGBoost ETL Notebook SERVERLESS DATA LAKE ON AWS 16/4/20 2
  • 3. ② Workshop Lab Guide >> Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 3
  • 4. ① Workshop Lab Guide >> Get Ready! q Data Analytics at Unicorn-Taxi-Company q Your dataset: NYC Taxi Trips q Simplified Raw Dataset vs. Original Raw Dataset SERVERLESS DATA LAKE ON AWS 16/4/20 4
  • 5. Data Analytics at Unicorn-Taxi Company Create a Dataset for Reporting and Visualization Cleanse Transform Optimize for reporting queries Help solve a Machine Learning problem Unicorn-Taxi’s Data Scientists need to understand passenger tipping behavior SERVERLESS DATA LAKE ON AWS 16/4/20 5
  • 6. Your dataset: NYC Taxi Trips Yellow and Green taxi trips § Pick-up and drop-off Dates/Times § Pick-up and drop-off Locations § Trip Distances § Itemized Fares § Tip Amount § Driver-reported Passenger Counts For-Hire Vehicle (FHV) § Pick-up and drop-off Dates/Times § Pick-up and drop-off Locations § Dispatching base SERVERLESS DATA LAKE ON AWS 16/4/20 6 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  • 7. Your dataset: NYC Taxi Trips Original Raw Dataset Green and Yellow Taxi + FHV Years 2009 to 2018 ~1.6Bn rows 215 files 253 GB total Simplified Raw Dataset Only Yellow Taxi + few look-ups Jan to March 2017 ~2M rows 3 files 2.5 GB+ uncompressed Ready in a publicly accessible Amazon S3 bucket SERVERLESS DATA LAKE ON AWS 16/4/20 7
  • 8. ① Workshop Lab Guide >> Get Ready! q Navigate to the Serverless Data Lake Workshop website SERVERLESS DATA LAKE ON AWS 16/4/20 8
  • 9. ① Workshop Lab Guide >> Get Ready! Setup Prerequisite Resources using AWS CloudFormation 1. A new Data Lake Amazon S3 bucket 2. Necessary IAM Policies & Roles for AWS Glue, Amazon Athena, and Amazon SageMaker 3. An AWS Glue Development Endpoint 4. A number of named queries in Amazon Athena Finally, the NYC Taxi Trips Raw Dataset is copied into your Amazon S3 bucket SERVERLESS DATA LAKE ON AWS 16/4/20 9
  • 10. ② Workshop Lab Guide - Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 10
  • 11. AWS Glue Crawlers and the AWS Glue Data Catalog Crawler … § Samples and Classifies your data § Extracts MetaData § Infers Schema and partitioning format § Creates Tables in your account’s AWS Glue Data Catalog
  • 12. Why Query with Amazon Athena? • Interactive query service • Makes it easy to analyze data in Amazon S3 using standard SQL • Out-of-the-box integrated with AWS Glue Data Catalog • Serverless
  • 13. AWS Glue Crawler AWS Glue Data Catalog NYC Taxi Raw Dataset 1 Concepts in Action 1. Crawler crawls Raw Dataset in Amazon S3 Bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query Raw Dataset in Athena 4. Athena uses Schema definition to read Raw Dataset from S3 and returns results 2 34 You
  • 14. Hand-ons 2.1.1. Catalog Raw Data with an AWS Glue Crawler 2.1.2. Explore Table Schema and Metadata in AWS Glue Data Catalog 2.1.3. Query Raw Data with Amazon Athena SERVERLESS DATA LAKE ON AWS 16/4/20 14
  • 15. ② Workshop Lab Guide - Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 15
  • 16. Interactive ETL Development with Glue & SageMaker What is Amazon SageMaker? § ... is a fully-managed ML Platform § … enables you to build, train, and deploy Machine Learning Models at any scale § … provides a Jupyter Notebook environment
  • 17. Connect your IDE to an AWS Glue Development Endpoint. Environment to interactively develop, debug, and test ETLcode. AWS Glue Spark environment Interpreter Server Remote Interpreter What is an AWS Glue Development Endpoint?
  • 18. Interactive ETL Development with Glue & SageMaker AWS Glue enables you to q… create an Amazon SageMaker Notebook environment q... connect it to an AWS Glue Dev Endpoint q… and finally, access your Jupyter notebook All without leaving the AWS Glue console
  • 19. AWS Glue Data Catalog NYC Taxi Raw Dataset 3 Concepts in Action 1. You author ETL code in an Amazon SageMaker Notebook 2. ETL code runs on AWS Glue Dev Endpoint 3. ETL code… a. Reads Raw Dataset b. Applies transformations c. Writes Optimized Dataset back to your Amazon S3 bucket 4. You use your ETL code in notebook to create a script file 2 1 Amazon SageMaker Notebook NYC Taxi Optimized Dataset 4 You
  • 20. Hand-ons 2.2.1. Create an Amazon SageMaker Notebook instance 2.2.2. Interactively author and run ETL code in Jupyter SERVERLESS DATA LAKE ON AWS 16/4/20 20
  • 21. Transformations we’ll apply to Raw Data § Join NYC Trips Dataset with look-up tables (denormalization) § Create new Timestamp columns for pick-up & drop-off § Drop unnecessary columns § Partition the dataset § Convert to a columnar file format (parquet)
  • 22. ② Workshop Lab Guide - Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 23
  • 23. Concepts in Action 1. Crawler crawls optimized dataset in Amazon S3 Bucket 2. Crawler writes metadata into AWS Glue Data Catalog 3. You query optimized dataset in Athena 4. Athena uses schema definition to read data from Amazon S3 and returns results NYC Taxi Optimized Dataset AWS Glue Crawler AWS Glue Data Catalog 1 2 34 You
  • 24. Hand-ons 2.3.1. Catalog Optimized Data with an AWS Glue Crawler 2.3.2. Query Optimized Data with Amazon Athena SERVERLESS DATA LAKE ON AWS 16/4/20 26
  • 25. ② Workshop Lab Guide - Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 27
  • 26. What is an AWS Glue Job? An AWS Glue Job encapsulates the business logic that performs extract, transform, and load (ETL) work. § A core building block in your production ETL pipeline § Provide your PySpark ETL script or have one automatically generated § Supports a rich set of built-in AWS Glue transformations § Jobs can be started, stopped, monitored
  • 27. What is an AWS Glue trigger? Triggers are the ‘Glue’ in your AWS Glue ETL pipeline. § Can be used to chain multiple AWS Glue Jobs in a series § Can start multiple Jobs at once § Can be Scheduled, On-demand, or based on Job Events § Can pass unique Parameters to customize AWS Glue job runs
  • 28. Three ways to set up an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State Machine-driven
  • 29. Schedule-driven AWS Glue ETL pipeline We work our way backwards from a daily SLA deadline New Raw Data arrives < 06:00 06:00 Crawl Raw Dataset Run Optimize Job 06:15 Crawl Optimized Dataset 07:00 SLA deadline 08:00 Ready for Reporting 07:30 15m 45m 30m 2h 30m
  • 30. Event-driven AWS Glue ETL pipeline Let Amazon CloudWatch Events and AWS Lambda drive the pipeline. < 06:00 Start Crawler Start Job or Trigger Start Crawler 08:00Reporting Dataset ready Data arrives in S3 Crawler succeeds Job succeeds New Raw Data arrives Crawl Raw Dataset Run Optimize Job Crawl Optimized Dataset SLA deadline Ready for Reporting
  • 31. State Machine-driven AWS Glue ETL pipeline qLet AWS Step Functions drive the pipeline.
  • 32. Concepts in Action We’ll build a Schedule-driven pipeline. 1. Scheduled crawler runs on Raw Dataset and updates AWS Glue Data Catalog 2. AWS Glue job starts based on Trigger(s) 3. Job reads Raw Dataset, applies transformations, and writes Optimized Dataset to your S3 bucket 4. Scheduled crawler runs on Optimized Dataset and updates Data Catalog NYC Taxi Optimized Dataset AWS Glue Crawler for Raw dataset AWS Glue Data Catalog 1 4 AWS Glue Job to created Optimized Dataset NYC Taxi Raw Dataset AWS Glue Crawler for Optimized dataset 3 2 Trigger
  • 33. Hand-ons 2.4.1. Schedule AWS Glue Crawlers 2.4.2. Create an AWS Glue Job 2.4.3. Create an AWS Glue trigger to run Jobs 2.4.4. AWS Glue Job Bookmarks 2.4. Setup an AWS Glue ETL pipeline SERVERLESS DATA LAKE ON AWS 16/4/20 35
  • 34. ② Workshop Lab Guide - Practice & Learn 16/4/20SERVERLESS DATA LAKE ON AWS 36
  • 35. ML Problem Definition Tip given to Taxi Drivers is substantial part of their Income They can also serve as loose metric for Service Quality that can be useful for taxi companies What are influences for higher tips (e.g., Taxi starting and ending in Richer part of cities?) We present analysis of factors affecting Tips and use these factors to predict Tips
  • 36. ML Architecture Results transform() Test Data fit() XGBoost-Regression Data Splitting Feature Engineering Endpoint Endpoint Configuration Model Creation Training Job Amazon SageMaker Notebook Instance – Glue Spark
  • 37. Concepts in Action 1. Launch Notebook 2. Read Optimized Dataset 3. Train the ML Model 4. Host the trained model and perform Prediction NYC Taxi Optimized AWS Glue Data Catalog AWS Glue Development Endpoint SageMaker Notebook 2 3 4 1
  • 38. Hand-ons 3.1. Interactively develop a Machine Learning model in Jupyter SERVERLESS DATA LAKE ON AWS 16/4/20 40
  • 39. Run ML Notebook interactively • Initialize variables and import spark libraries • Retrieve Optimized NYC Taxi trips Dataset • Observe features against target label (tip_amount) • Perform feature engineering • Split feature engineered dataset into Train and Test • Launch Spark XGBoostEstimator (Observe hyperparameters) • Perform Prediction and observe accuracy