SlideShare a Scribd company logo
1 of 64
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep dive and best practices
for Amazon Athena
R o y H a s s o n , B u s i n e s s D e v e l o p m e n t M a n a g e r , A m a z o n A t h e n a
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. Athena, Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We launched November 2016!
1. Rest API with support for SDK in most languages
2. CloudTrail Support
3. JDBC and ODBC drivers
4. Launched in 11 AWS regions
5. Added support for AVRO and Geospatial data and Querying
using GROK filter
6. Support for processing encrypted data using S3-SSE and CSE with
keys in AWS KMS
7. Integration with AWS Glue
8. Cloudformation support for creating Named Queries
9. Audit logging via AWS CloudTrail
10. Integration with S3 Inventory and EC2 Systems manager Inventory
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Integration with several partners
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. Athena, Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case 1: Query engine for data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Athena
AWS Glue Data Catalog
Query data
Hot data
Warn & cold dataApplication request
2: Embed SQL for direct access to data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Timber.io
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3: Migration from an On-premises infrastructure
• ETL
• Data Wrangling
• Machine Learning
• Interactive Querying
• Data Science
• Hive Metastore
• Glue ETL and EMR for ETL
• Athena, EMR and Redshift for
Interactive querying
• EMR for ML
• Glue Data Catalog Hosted
Metadata Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on S3
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
Glue ETL
AMAZON
QUICKSIGHT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Glue Data Catalog
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
S3
Lambda
4: Querying Logs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Glue
Crawler
Update table partition
Create partition
on S3
Query data
S3
Glue ETL
4a. Querying logs with ETL
Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Big Data Blog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airbnb’s StreamAlert
• Deployment is automated: simple, safe and repeatable for any AWS
account
• Easily scalable from megabytes to terabytes per day
• Minimal Infrastructure maintenance
• Infrastructure security is a default, no security expertise required
• Supports data from different environments (ex: IT, PCI, Engineering)
• Supports data from different environment types (ex: Cloud, Datacenter,
Office)
• Supports different types of data (ex: JSON, CSV, Key-Value, or Syslog)
• Supports different use-cases like security, infrastructure, compliance and
more
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overall Architecture
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How is Athena used in Streamalert
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Athena
Database Migration Exported tables in S3
Trigger Lambda
Update table partition
Query data
Lambda
Database Migration
Service
5. Query database backups
Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Athena
Real-time events Store partitioned in S3
Trigger Lambda
Update table partition
Query data
Lambda
Kinesis
6. Time-series data
Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
7. Geospatial Data
Geometry Data Types supported
• point
• line
• polygon
• multiline
• Multipolygon
Functions supported available in
Docs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. Athena, Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena, Redshift Spectrum & EMR
Amazon
Redshift
Spectrum
Amazon EMR Amazon Athena
Amazon
S3
Expand your use-case to allow
processing data from a S3-based
Data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. How does Athena compare to Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connecting with Athena
1. API
2. JDBC and ODBC driver
3. Console
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Connecting to Amazon Athena - API
StartQueryExecution
client.startQueryExecution({
QueryString: ‘SELECT * FROM table_name LIMIT 100’,
ResultConfiguration: { OutputLocation: ‘s3://bucket/output/’ },
EncryptionConfiguration: { EncryptionOption: 'SSE_S3' },
QueryExecutionContext: { Database: ‘default_db’ }
}, (err, result) => {})
GetQueryResults
client.getQueryResults({
QueryExecutionId: '2ef5d590-025a-48ec-895e-6bedfe72bc95',
MaxResults: 1000,
NextToken: null
}, (err, data) => {})
Asynchronous
Query API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. How does Athena compare to Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Databases and Tables
1. Databases are logical collection of Tables
2. Database and Tables limits that you can raise
3. Many ways to write DDL Statement
1. Use the Hive DDL statement directly from the console
2. Use the Athena API
3. Use the JDBC/ODBC
4. Use the Glue Crawlers
5. Use the Glue API to create Tables
6. If using the Glue Data Catalog, You can also share metadata between EMR and
Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Database and Tables
S3://mydatalogs/data/object1
S3://mydatalogs/data/object2
S3://mydatalogs/data/object3
S3://mydatalogs/data/object4
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
't'
STORED AS TEXTFILE
LOCATION 's3://mydatalogs/data/'
Data Catalog
Table name: access_logs
Location:s3://mydatalogs/data
Serde information:
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Schema on write
• Create the schema
• Fit the data to the schema
Good for repetition and performance
Schema on Read
• Store raw data
• Schema is applied to the data
Good for exploration, and experimentation
Schema on Read Versus Schema on Write
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SerDes and Compression formats
1.Regex
2.GROK
3.JSON
4.OpenCSV
5.TSV
6.Parquet
7.ORC
8.AVRO
9.Geospatial
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. How does Athena compare to Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the AWS Glue Data Catalog?
Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and
Amazon S3…with support for more coming soon!
• Get a single view into your data, no matter where it is stored
• Automatically classify your data in one central list that is searchable
• Track data evolution using schema versioning
• Query your data using Amazon Athena or Amazon Redshift Spectrum
• Hive metastore compatible; can be used as an external Hive Metastore for applications
running on Amazon EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on Amazon S3 using AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
Glue ETL
AMAZON
QUICKSIGHT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What do I do to setup my Data Catalog
Its Easy!
1. Tell us where your data is
2. Tell us how often you want to check for updates
And you are done! Your Data Catalog is ready for search and querying
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table schema
Table properties
Data statistics
Nested fields
A table in the Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically register available partitions
Table
partitions
Automatically detected partitions
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are crawlers?
Crawlers automatically build your Data Catalog and keep it in sync
• Scan your data stored in various data stores, extract metadata and data statistics, and add
table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run ad hoc or on a schedule; serverless – only pay when crawler runs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How is my data classified?
Glue applies a set of classifiers to the data as it scans it and adds the metadata as Tables to the
Data Catalog.
A classifier recognizes the format of your data and generates a schema.
It returns a certainty number between 0.0 and 1.0 which helps Glue determine if there is a
match
Glue has a list of in-build classifiers that are applied with every crawl. But you can write your
own!
You can set up your crawler with an ordered set of classifiers. Crawlers invoke classifiers in the
order they were provided until a match is found.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How are partitions detected?
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How can I write my own classifiers?
You can write a custom classifier by providing a grok
pattern and a classification string for the matched
schema
A grok pattern is a named set of regular expressions
(regex) that are used to match data one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena Data Catalog or Glue Data Catalog
Athena web
Service
DDL
statements
Data Catalog
Converts DDL to API calls
Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
W h at h ap p en s wh en you u p g rad e to a Glu e Catalog
Athena web
Service
DDL
statements
Glue Data Catalog
Converts DDL to API calls
Athena
Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Upgrade is a simple process
1.Add Glue-specific policies to your existing Athena policies
2.Add permission to Glue:ImportCatalogToGlue
3.Upgrade the catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of upgrading
1. You can now share metadata between EMR (Hive, Spark and Presto), Athena and
Redshift Spectrum
2. You can use the Glue API to directly run DDL at high concurrency (table creation,
partitions)
3. Use the crawler for automatic recognition of schema and partitioning
4. Use Glue to ETL data and query in Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. How does Athena compare to Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Partitioning
• Separates data files by any column
• Read only data the query needs
• Reduce amount of data scanned
• decrease query completion time
• Reduce query cost
• Define Partitions at the time of creating tables.
• Partitions are virtual columns that you can reference in your query
• Athena uses Hive-style partitioning
• Works on both text and columnar data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partitioning of data has cost advantages
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
But it also has an overhead
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Layout of data for partitioned tables
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME1>=<VALUE>/<PARTITION_COLUMN_NAME2>=<VALUE>/
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp
)
PARTTIONED by
(string year,
string month)
STORED AS PARQUET
LOCATION 's3://yourBucket/pathToTable/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Layout of data for partitioned tables
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp
)
PARTTIONED by
(string year,
string month)
STORED AS PARQUET
LOCATION 's3://yourBucket/pathToTable/
s3://yourBucket/pathToTable/year=2017/month=11/
Alter Table access_logs
Add Partition
year=‘2017',
month=‘11‘)
location ‘s3://yourBucket/pathToTable/year=2017/month=11/
MSCK Repair Table
GLUE API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Layout of data for partitioned tables
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
)
PARTTIONED by
(string year,
string month)
STORED AS PARQUET
LOCATION 's3://yourBucket/pathToTable/
s3://yourBucket/pathToTable/2017/11/
Alter Table access_logs
Add Partition
year=‘2017',
month=‘11‘)
location
‘s3://yourBucket/pathToTable/2017/11/’
MSCK Repair Table
GLUE API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partition loading and S3 prefixes
s3://yourBucket/pathToTable/year=2017/month=11/ - MSCK REPAIR TABLE
s3://yourBucket/pathToTable/232a-2013-26-05/hhdsy1129s/
ALTER TABLE access_log ADD PARTITION (year='2017',month=‘11‘)
location ‘s3://yourBucket/pathToTable/232a-2013-26-05/hhdsy1129s/’
1
2
3
s3://yourBucket/pathToTable/2017/11/
ALTER TABLE access_log ADD PARTITION (year=‘2017',month=‘11‘)
location ‘s3://yourBucket/pathToTable/2017/11/’
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Loading Partitions
• Use Lambda – either on a schedule or based on event
• Use MSCK – MSCK is an expensive operation
• Use Alter Table Add Partition – You can add 99 partitions at the same time.
• Use the Glue Data Catalog API/CLI – This is preferred for high concurrency
• Pre-populate at the start/end of the day – these are just key-value lookups
• Schedule the crawler to automatically detect partitions.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Choose partitions to get performance even
on text data
1. Columns that can be used as filter (e.g. date)
2. You can partition by any number of columns
3. You can partition on any column not just date
4. {Partition key 1, Partition key 2, Partition 3}  unique location in S3
5. Look at your query patterns – What data do you want to query and what do you want to
filter out
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
1. Review of the year
2. Use-cases seen since launch
3. How does Athena compare to Redshift Spectrum, and EMR
4. Connecting with Amazon Athena
5. Creating Tables
6. Glue Data Catalog
7. Partitioning data
8. Running queries and performance optimizations
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Performance – bigger files help
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Allow users to query raw and aggregates
Kinesis Firehose
Amazon S3
Amazon S3 Amazon Athena
“raw-time-series”
Amazon S3
“hourly
aggregations”
AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query performance- ORC and Parquet
Dataset Size on S3 Query run time Data Scanned Cost
Text 1.15TB 236 seconds 1.15 TB $5.75
Parquet
(same
data)
130GB 6.78 seconds 2.51 GB $0.013
Savings/Sp
eed up
87% less
with Parquet
34 x
99% less data
scanned
99.7% savings
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Performance – Order by
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Performance - Joins
Keep the larger Table on the Left Side of the join
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Performance – Regex better than Like
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

More Related Content

What's hot

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon Web Services Korea
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019Amazon Web Services Korea
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon Web Services Korea
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 

What's hot (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
효율적인 빅데이터 분석 및 처리를 위한 Glue, EMR 활용 - 김태현 솔루션즈 아키텍트, AWS :: AWS Summit Seoul 2019
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3) AWS Simple Storage Service (s3)
AWS Simple Storage Service (s3)
 
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
Amazon EMR과 SageMaker를 이용하여 데이터를 준비하고 머신러닝 모델 개발 하기
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 

Similar to Deep Dive on Amazon Athena - AWS Online Tech Talks

21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...Amazon Web Services
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...Amazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksAmazon Web Services
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudAmazon Web Services
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Amazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...Amazon Web Services
 
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...Amazon Web Services
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Amazon Web Services
 
2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)REAN Cloud
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Amazon Web Services
 

Similar to Deep Dive on Amazon Athena - AWS Online Tech Talks (20)

21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...
DAT339_Replicate, Analyze, and Visualize Datasets Using AWS Database Migratio...
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...
Migrating Massive Databases and Data Warehouses to the Cloud - ENT327 - re:In...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the Cloud
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...
Migration Planning with AWS Application Discovery Service - ENT308 - Chicago ...
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Deep Dive on Amazon Athena - AWS Online Tech Talks

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deep dive and best practices for Amazon Athena R o y H a s s o n , B u s i n e s s D e v e l o p m e n t M a n a g e r , A m a z o n A t h e n a
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. Athena, Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. We launched November 2016! 1. Rest API with support for SDK in most languages 2. CloudTrail Support 3. JDBC and ODBC drivers 4. Launched in 11 AWS regions 5. Added support for AVRO and Geospatial data and Querying using GROK filter 6. Support for processing encrypted data using S3-SSE and CSE with keys in AWS KMS 7. Integration with AWS Glue 8. Cloudformation support for creating Named Queries 9. Audit logging via AWS CloudTrail 10. Integration with S3 Inventory and EC2 Systems manager Inventory
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Integration with several partners
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. Athena, Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case 1: Query engine for data lake
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Athena AWS Glue Data Catalog Query data Hot data Warn & cold dataApplication request 2: Embed SQL for direct access to data
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Timber.io
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3: Migration from an On-premises infrastructure • ETL • Data Wrangling • Machine Learning • Interactive Querying • Data Science • Hive Metastore • Glue ETL and EMR for ETL • Athena, EMR and Redshift for Interactive querying • EMR for ML • Glue Data Catalog Hosted Metadata Catalog
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on S3 On premises data Web app data Amazon RDS Other databases Streaming data Your data Glue ETL AMAZON QUICKSIGHT
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Service Logs Web Application Logs Server Logs S3 Athena Glue Data Catalog New File Trigger Update table partition Create partition on S3 Copy to new partition Query data S3 Lambda 4: Querying Logs
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Service Logs Web Application Logs Server Logs S3 Athena Glue Crawler Update table partition Create partition on S3 Query data S3 Glue ETL 4a. Querying logs with ETL Glue Data Catalog
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Big Data Blog
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airbnb’s StreamAlert • Deployment is automated: simple, safe and repeatable for any AWS account • Easily scalable from megabytes to terabytes per day • Minimal Infrastructure maintenance • Infrastructure security is a default, no security expertise required • Supports data from different environments (ex: IT, PCI, Engineering) • Supports data from different environment types (ex: Cloud, Datacenter, Office) • Supports different types of data (ex: JSON, CSV, Key-Value, or Syslog) • Supports different use-cases like security, infrastructure, compliance and more
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overall Architecture
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How is Athena used in Streamalert
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Athena Database Migration Exported tables in S3 Trigger Lambda Update table partition Query data Lambda Database Migration Service 5. Query database backups Glue Data Catalog
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Athena Real-time events Store partitioned in S3 Trigger Lambda Update table partition Query data Lambda Kinesis 6. Time-series data Glue Data Catalog
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 7. Geospatial Data Geometry Data Types supported • point • line • polygon • multiline • Multipolygon Functions supported available in Docs
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. Athena, Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena, Redshift Spectrum & EMR Amazon Redshift Spectrum Amazon EMR Amazon Athena Amazon S3 Expand your use-case to allow processing data from a S3-based Data lake
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. How does Athena compare to Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connecting with Athena 1. API 2. JDBC and ODBC driver 3. Console
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Connecting to Amazon Athena - API StartQueryExecution client.startQueryExecution({ QueryString: ‘SELECT * FROM table_name LIMIT 100’, ResultConfiguration: { OutputLocation: ‘s3://bucket/output/’ }, EncryptionConfiguration: { EncryptionOption: 'SSE_S3' }, QueryExecutionContext: { Database: ‘default_db’ } }, (err, result) => {}) GetQueryResults client.getQueryResults({ QueryExecutionId: '2ef5d590-025a-48ec-895e-6bedfe72bc95', MaxResults: 1000, NextToken: null }, (err, data) => {}) Asynchronous Query API
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. How does Athena compare to Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Databases and Tables 1. Databases are logical collection of Tables 2. Database and Tables limits that you can raise 3. Many ways to write DDL Statement 1. Use the Hive DDL statement directly from the console 2. Use the Athena API 3. Use the JDBC/ODBC 4. Use the Glue Crawlers 5. Use the Glue API to create Tables 6. If using the Glue Data Catalog, You can also share metadata between EMR and Athena
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Database and Tables S3://mydatalogs/data/object1 S3://mydatalogs/data/object2 S3://mydatalogs/data/object3 S3://mydatalogs/data/object4 CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://mydatalogs/data/' Data Catalog Table name: access_logs Location:s3://mydatalogs/data Serde information:
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Schema on write • Create the schema • Fit the data to the schema Good for repetition and performance Schema on Read • Store raw data • Schema is applied to the data Good for exploration, and experimentation Schema on Read Versus Schema on Write
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SerDes and Compression formats 1.Regex 2.GROK 3.JSON 4.OpenCSV 5.TSV 6.Parquet 7.ORC 8.AVRO 9.Geospatial
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. How does Athena compare to Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the AWS Glue Data Catalog? Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3…with support for more coming soon! • Get a single view into your data, no matter where it is stored • Automatically classify your data in one central list that is searchable • Track data evolution using schema versioning • Query your data using Amazon Athena or Amazon Redshift Spectrum • Hive metastore compatible; can be used as an external Hive Metastore for applications running on Amazon EMR
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on Amazon S3 using AWS Glue On premises data Web app data Amazon RDS Other databases Streaming data Your data Glue ETL AMAZON QUICKSIGHT
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What do I do to setup my Data Catalog Its Easy! 1. Tell us where your data is 2. Tell us how often you want to check for updates And you are done! Your Data Catalog is ready for search and querying
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table schema Table properties Data statistics Nested fields A table in the Glue Data Catalog
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically register available partitions Table partitions Automatically detected partitions
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are crawlers? Crawlers automatically build your Data Catalog and keep it in sync • Scan your data stored in various data stores, extract metadata and data statistics, and add table definitions to your Data Catalog • Classify data using built-in and custom classifiers • You can write your own using Grok expressions • Discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 • Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How is my data classified? Glue applies a set of classifiers to the data as it scans it and adds the metadata as Tables to the Data Catalog. A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0 which helps Glue determine if there is a match Glue has a list of in-build classifiers that are applied with every crawl. But you can write your own! You can set up your crawler with an ordered set of classifiers. Crawlers invoke classifiers in the order they were provided until a match is found.
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How are partitions detected? Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution…
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How can I write my own classifiers? You can write a custom classifier by providing a grok pattern and a classification string for the matched schema A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. Example: %{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena Data Catalog or Glue Data Catalog Athena web Service DDL statements Data Catalog Converts DDL to API calls Athena
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. W h at h ap p en s wh en you u p g rad e to a Glu e Catalog Athena web Service DDL statements Glue Data Catalog Converts DDL to API calls Athena Glue
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Upgrade is a simple process 1.Add Glue-specific policies to your existing Athena policies 2.Add permission to Glue:ImportCatalogToGlue 3.Upgrade the catalog
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of upgrading 1. You can now share metadata between EMR (Hive, Spark and Presto), Athena and Redshift Spectrum 2. You can use the Glue API to directly run DDL at high concurrency (table creation, partitions) 3. Use the crawler for automatic recognition of schema and partitioning 4. Use Glue to ETL data and query in Athena
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. How does Athena compare to Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Partitioning • Separates data files by any column • Read only data the query needs • Reduce amount of data scanned • decrease query completion time • Reduce query cost • Define Partitions at the time of creating tables. • Partitions are virtual columns that you can reference in your query • Athena uses Hive-style partitioning • Works on both text and columnar data
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partitioning of data has cost advantages
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. But it also has an overhead
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Layout of data for partitioned tables s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME1>=<VALUE>/<PARTITION_COLUMN_NAME2>=<VALUE>/ CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp ) PARTTIONED by (string year, string month) STORED AS PARQUET LOCATION 's3://yourBucket/pathToTable/
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Layout of data for partitioned tables CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp ) PARTTIONED by (string year, string month) STORED AS PARQUET LOCATION 's3://yourBucket/pathToTable/ s3://yourBucket/pathToTable/year=2017/month=11/ Alter Table access_logs Add Partition year=‘2017', month=‘11‘) location ‘s3://yourBucket/pathToTable/year=2017/month=11/ MSCK Repair Table GLUE API
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Layout of data for partitioned tables CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, ) PARTTIONED by (string year, string month) STORED AS PARQUET LOCATION 's3://yourBucket/pathToTable/ s3://yourBucket/pathToTable/2017/11/ Alter Table access_logs Add Partition year=‘2017', month=‘11‘) location ‘s3://yourBucket/pathToTable/2017/11/’ MSCK Repair Table GLUE API
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partition loading and S3 prefixes s3://yourBucket/pathToTable/year=2017/month=11/ - MSCK REPAIR TABLE s3://yourBucket/pathToTable/232a-2013-26-05/hhdsy1129s/ ALTER TABLE access_log ADD PARTITION (year='2017',month=‘11‘) location ‘s3://yourBucket/pathToTable/232a-2013-26-05/hhdsy1129s/’ 1 2 3 s3://yourBucket/pathToTable/2017/11/ ALTER TABLE access_log ADD PARTITION (year=‘2017',month=‘11‘) location ‘s3://yourBucket/pathToTable/2017/11/’
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Loading Partitions • Use Lambda – either on a schedule or based on event • Use MSCK – MSCK is an expensive operation • Use Alter Table Add Partition – You can add 99 partitions at the same time. • Use the Glue Data Catalog API/CLI – This is preferred for high concurrency • Pre-populate at the start/end of the day – these are just key-value lookups • Schedule the crawler to automatically detect partitions.
  • 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Choose partitions to get performance even on text data 1. Columns that can be used as filter (e.g. date) 2. You can partition by any number of columns 3. You can partition on any column not just date 4. {Partition key 1, Partition key 2, Partition 3}  unique location in S3 5. Look at your query patterns – What data do you want to query and what do you want to filter out
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda 1. Review of the year 2. Use-cases seen since launch 3. How does Athena compare to Redshift Spectrum, and EMR 4. Connecting with Amazon Athena 5. Creating Tables 6. Glue Data Catalog 7. Partitioning data 8. Running queries and performance optimizations
  • 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Performance – bigger files help
  • 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Allow users to query raw and aggregates Kinesis Firehose Amazon S3 Amazon S3 Amazon Athena “raw-time-series” Amazon S3 “hourly aggregations” AWS Glue
  • 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query performance- ORC and Parquet Dataset Size on S3 Query run time Data Scanned Cost Text 1.15TB 236 seconds 1.15 TB $5.75 Parquet (same data) 130GB 6.78 seconds 2.51 GB $0.013 Savings/Sp eed up 87% less with Parquet 34 x 99% less data scanned 99.7% savings
  • 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Performance – Order by
  • 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Performance - Joins Keep the larger Table on the Left Side of the join
  • 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Query Performance – Regex better than Like
  • 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!