SlideShare a Scribd company logo
1 of 26
Download to read offline
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
HC Lo, Solutions Architect
Data Catalog & ETL - Glue &
Athena
September 12, 2019
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
What is AWS Glue Data Catalog?
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3…with support for more coming!
• Get a single view into your data, no matter where it is stored
• Automatically classify your data in one central list that is searchable
• Track data evolution using schema versioning
• Query your data using Amazon Athena or Amazon Redshift Spectrum
• Hive metastore compatible; can be used as an external Hive Metastore for
applications running on Amazon EMR
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
What is a Data Lake
Architectural pattern enabling:
• Ubiquitous storage at any scale
• Consolidated data processing
• Collaborate and analyze data in
different ways leading to better,
faster decision making
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Most comprehensive
Broadest and deepest portfolio, purpose-built for builders
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data Movement
Analytics
Data Lake Infrastructure & Management
Dashboards Predictive Analytics
Visualization, Engagement, & Machine Learning
Digital User Engagement
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Movement
Analytics
Most comprehensive
Broadest and deepest portfolio, purpose-built for builders
+ 11 more
Redshift
EMR (Spark
& Hadoop)
Athena
Elasticsearch
Service
Kinesis Data
Analytics
Glue (Spark
& Python)
S3/Glacier GlueLake
Formation
Visualization, Engagement, & Machine Learning
QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Transcribe
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Kafka
Data Lake Infrastructure & Management
Pinpoint
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Popular Customer Use Cases
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on AWS
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AWS GLUE
ETL
Amazon
QuickSight
Amazon
SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
S3
Lambda
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation with ETL
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Glue
Crawler
Update table partition
Create partition
on S3
Query data
S3
Glue ETL
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-Time Data Collection
S3
Athena
Real-time events Store partitioned in S3
Trigger Job
Update table partition
Query data
Kinesis
Glue ETL
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Export
S3
Athena
Database Migration Exported tables in S3
Trigger Job
Update table partition
Query data
Database Migration
Service
Glue ETL
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SaaS Model
S3
Athena
Query data
Hot data
Warn & cold dataApplication request
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Science
S3
Athena
Application Data
S3 Glue ETL
Athena
SageMaker
EMR
Enrichment Feature
Store
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3S3
AWS Glue
ETL
Athena
Amazon
Reviews Dataset
Glue
Data Catalog
1
Comprehend
2
3
Glue Crawler
4
QuickSight
5
Data Enrichment – Amazon Comprehend
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Connect
Kinesis Data
Streams
Agent
Events
Kinesis Data
Firehose
S3 Athena
AWS Glue
Data Catalog
Firehouse
Output Schema
Parquet
1
2 3
4
5
Redshift
Spectrum
Data Ingest in Parquet Format
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics Reporting
Athena
Redshift
Spectrum
EMR
API
QuickSight
Glue
Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Amazon Athena is an interactive query service
that makes it easy to analyze data directly on
Amazon S3 using Standard SQL
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Why Amazon Athena ?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in transit
• Standard compliant and open storage file formats
• Built on powerful community supported OSS solutions
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
(Eg. SELECT * FROM tableName)
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
(Eg. CREATE TABLE, ALTER TABLE, MSCK REPAIR)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries & window
functions
• Complex data types (arrays, structs, maps)
• Presto built-in functions
• File Formats: CSV, JSON, RegEx, Parquet, Avro,
ORC, CloudTrail
• Compression: GZIP, Zlib, LZO, Snappy
• Integrated with AWS Glue Data Catalog
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
A Better Model
Old Methodology
• Analyst asks for a report
• Developer writes code
• Code executes on shared
cluster for several hours
• Analyst reviews report
• Analyst asks for more…
With Amazon Athena
• Analyst creates table
• Analyst iterates
• Generate final report
Simple, Quick and No Infrastructure or Developer to Manage
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer
apply
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Security and Access Control
• Encryption – SSE, SSE-KMS, CSE-KMS
• Auto detect source bucket KMS key
• Destination bucket may use separate key
• Access Control
• IAM
• S3 ACL
• S3 bucket policies
• Coming… Athorization with Glue Data Catalog
• Database level
• Table level
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Cost Monitoring
• Billing console provides spend per account
• Athena APIs are logged in CloudTrail
• Combine CloudTrail and Athena API for per IAM user cost
• More cost controls to come…
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
LAB 2 - Guide
http://bit.ly/2md1R9z
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2019, Amazon Web Services, Inc. or its Affiliates.
【AWS 亞馬遜雲端聚落】
意猶未盡 ?
立即加入LINE好友 >>掌握AWS最新消息 !
Thank you!

More Related Content

What's hot

Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...Amazon Web Services
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Tom Rieger
 
VILT - Archiving and Decommissioning with OpenText InfoArchive
VILT - Archiving and Decommissioning with OpenText InfoArchiveVILT - Archiving and Decommissioning with OpenText InfoArchive
VILT - Archiving and Decommissioning with OpenText InfoArchiveVILT
 
Immersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoImmersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoAmazon Web Services LATAM
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseAmazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsAmazon Web Services
 
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...DataWorks Summit
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...Amazon Web Services
 
DataArchiva - A Native Data Archiving Solution for Salesforce
DataArchiva - A Native Data Archiving Solution for Salesforce DataArchiva - A Native Data Archiving Solution for Salesforce
DataArchiva - A Native Data Archiving Solution for Salesforce DataArchiva
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFAmazon Web Services
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWSAmazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 

What's hot (20)

Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...
Enable Programmatic and Federated Access to Amazon Athena (ANT380-R1) - AWS r...
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
Accelerated Data Lakes Webinar
Accelerated Data Lakes WebinarAccelerated Data Lakes Webinar
Accelerated Data Lakes Webinar
 
Solution Architecture - AWS
Solution Architecture - AWSSolution Architecture - AWS
Solution Architecture - AWS
 
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
Webcast slides for "Low Risk and High Reward in App Decomm with InfoArchive a...
 
VILT - Archiving and Decommissioning with OpenText InfoArchive
VILT - Archiving and Decommissioning with OpenText InfoArchiveVILT - Archiving and Decommissioning with OpenText InfoArchive
VILT - Archiving and Decommissioning with OpenText InfoArchive
 
Solution architecture Amazon web services
Solution architecture Amazon web servicesSolution architecture Amazon web services
Solution architecture Amazon web services
 
Immersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dadoImmersion Day - Democratize o acesso ao dado
Immersion Day - Democratize o acesso ao dado
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
How Market Intelligence From Hadoop on Azure Shows Trucking Companies a Clear...
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
AWS Public Datasets: Learnings from Staging Petabytes of Data for Analysis in...
 
DataArchiva - A Native Data Archiving Solution for Salesforce
DataArchiva - A Native Data Archiving Solution for Salesforce DataArchiva - A Native Data Archiving Solution for Salesforce
DataArchiva - A Native Data Archiving Solution for Salesforce
 
Non-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SFNon-Relational Revolution: Database Week SF
Non-Relational Revolution: Database Week SF
 
Building a modern data platform in AWS
Building a modern data platform in AWSBuilding a modern data platform in AWS
Building a modern data platform in AWS
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 

Similar to AWS Glue Data Catalog & Athena for Data Lake Analytics

在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析Amazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSSteven Hsieh
 
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAutomate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopWild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopAWS Germany
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin BriskmanSameer Kenkare
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similar to AWS Glue Data Catalog & Athena for Data Lake Analytics (20)

在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWSAWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
AWS 2019 Taipei Summit - Building Serverless Analytics Platform on AWS
 
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAutomate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics Platforms
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopWild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Value of Data Beyond Analytics by Darin Briskman
 Value of Data Beyond Analytics by Darin Briskman Value of Data Beyond Analytics by Darin Briskman
Value of Data Beyond Analytics by Darin Briskman
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Glue Data Catalog & Athena for Data Lake Analytics

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark HC Lo, Solutions Architect Data Catalog & ETL - Glue & Athena September 12, 2019
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark What is AWS Glue Data Catalog? Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3…with support for more coming! • Get a single view into your data, no matter where it is stored • Automatically classify your data in one central list that is searchable • Track data evolution using schema versioning • Query your data using Amazon Athena or Amazon Redshift Spectrum • Hive metastore compatible; can be used as an external Hive Metastore for applications running on Amazon EMR
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark What is a Data Lake Architectural pattern enabling: • Ubiquitous storage at any scale • Consolidated data processing • Collaborate and analyze data in different ways leading to better, faster decision making
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Most comprehensive Broadest and deepest portfolio, purpose-built for builders Migration & Streaming Services Infrastructure Data Catalog & ETL Security & Management Data Warehousing Big Data Processing Interactive Query Operational Analytics Real time Analytics Serverless Data processing Data Movement Analytics Data Lake Infrastructure & Management Dashboards Predictive Analytics Visualization, Engagement, & Machine Learning Digital User Engagement
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Data Movement Analytics Most comprehensive Broadest and deepest portfolio, purpose-built for builders + 11 more Redshift EMR (Spark & Hadoop) Athena Elasticsearch Service Kinesis Data Analytics Glue (Spark & Python) S3/Glacier GlueLake Formation Visualization, Engagement, & Machine Learning QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Transcribe Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Kafka Data Lake Infrastructure & Management Pinpoint
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Popular Customer Use Cases
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on AWS On premises data Web app data Amazon RDS Other databases Streaming data Your data AWS GLUE ETL Amazon QuickSight Amazon SageMaker
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log Aggregation AWS Service Logs Web Application Logs Server Logs S3 Athena New File Trigger Update table partition Create partition on S3 Copy to new partition Query data S3 Lambda Glue Data Catalog
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Log Aggregation with ETL AWS Service Logs Web Application Logs Server Logs S3 Athena Glue Crawler Update table partition Create partition on S3 Query data S3 Glue ETL Glue Data Catalog
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-Time Data Collection S3 Athena Real-time events Store partitioned in S3 Trigger Job Update table partition Query data Kinesis Glue ETL Glue Data Catalog
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Export S3 Athena Database Migration Exported tables in S3 Trigger Job Update table partition Query data Database Migration Service Glue ETL Glue Data Catalog
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SaaS Model S3 Athena Query data Hot data Warn & cold dataApplication request Glue Data Catalog
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Science S3 Athena Application Data S3 Glue ETL Athena SageMaker EMR Enrichment Feature Store Glue Data Catalog
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3S3 AWS Glue ETL Athena Amazon Reviews Dataset Glue Data Catalog 1 Comprehend 2 3 Glue Crawler 4 QuickSight 5 Data Enrichment – Amazon Comprehend
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Connect Kinesis Data Streams Agent Events Kinesis Data Firehose S3 Athena AWS Glue Data Catalog Firehouse Output Schema Parquet 1 2 3 4 5 Redshift Spectrum Data Ingest in Parquet Format
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analytics Reporting Athena Redshift Spectrum EMR API QuickSight Glue Data Catalog
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Amazon Athena is an interactive query service that makes it easy to analyze data directly on Amazon S3 using Standard SQL
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Why Amazon Athena ? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Secure – IAM for authentication; Encryption at rest & in transit • Standard compliant and open storage file formats • Built on powerful community supported OSS solutions
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions (Eg. SELECT * FROM tableName) Used for DDL functionality Complex data types Multitude of formats Supports data partitioning (Eg. CREATE TABLE, ALTER TABLE, MSCK REPAIR)
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Presto built-in functions • File Formats: CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail • Compression: GZIP, Zlib, LZO, Snappy • Integrated with AWS Glue Data Catalog
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark A Better Model Old Methodology • Analyst asks for a report • Developer writes code • Code executes on shared cluster for several hours • Analyst reviews report • Analyst asks for more… With Amazon Athena • Analyst creates table • Analyst iterates • Generate final report Simple, Quick and No Infrastructure or Developer to Manage
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfer apply
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Security and Access Control • Encryption – SSE, SSE-KMS, CSE-KMS • Auto detect source bucket KMS key • Destination bucket may use separate key • Access Control • IAM • S3 ACL • S3 bucket policies • Coming… Athorization with Glue Data Catalog • Database level • Table level
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Cost Monitoring • Billing console provides spend per account • Athena APIs are logged in CloudTrail • Combine CloudTrail and Athena API for per IAM user cost • More cost controls to come…
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark LAB 2 - Guide http://bit.ly/2md1R9z
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2019, Amazon Web Services, Inc. or its Affiliates. 【AWS 亞馬遜雲端聚落】 意猶未盡 ? 立即加入LINE好友 >>掌握AWS最新消息 ! Thank you!