SlideShare a Scribd company logo
1 of 74
Download to read offline
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive query service to analyze data in Amazon S3 using standard SQL
Amazon Athena
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Amazon Athena Overview
• Athena Design Patterns
• Athena in Action
• Create External Tables
• Data Formats and SerDes
• Partitions
• Optimize and Secure
• Workgroups
• Views
• CTAS
• Monitoring & Auditing
• Summary
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Introduction to Amazon Athena
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
Serverless Distributed (SQL) Query Engine for Amazon S3
Distributed SQL Query Engine Amazon S3
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon Athena ?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM/SAMLv2 for authentication; Encryption at rest
& in transit
• Standard compliant and open storage file formats
• Built on powerful community supported OSS solutions
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data
transfers apply
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Runs standard SQL
• Uses Presto with ANSI SQL support
• Works with standard data formats
• CSV
• Apache Weblogs
• JSON
• Parquet
• ORC
• Handles complex queries
• Large Joins
• Window functions
• Arrays
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fast Performance for Large Data Sets
• Fast, ad-hoc queries
• Executes queries in parallel
• No provisioning extra resources for complex queries
• Scales automatically
• Amazon Athena Federated Query
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage
Query
Engine Scale-up
Scale-out
How to handle Large Data Sets?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage
Storage Interface Layer
Storage
Compute
Decouple storage from compute
Compute
Scale-out
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage Interface Layer
Scale-out
Storage
Scale-out
Compute
Distributed Processing Framework
Distributed File System (DFS)
ex) HDFS, Amazon S3, …
ex) Hadoop MapReduce,
Apache Spark, Flink
Apache Hive, Presto
Amazon Athena, …
Distributed Data Processing System
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2. Athena Design Patterns
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1: Ad-hoc use-case
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Athena
AWS Glue Data Catalog
Query data
Hot data
Warm & cold data
Application request
2: SaaS use-case
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS service logs
Application logs
Data sourced from external
vendors
S3
Athena
Update table partition
Query data
S3
Athena CTAS and INSERT INTO
to ETL
3: ETL and query use-case
Glue Data Catalog
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 Athena
Query data
S3
4: Data science exploration and feature engineering
Glue Data Catalog
Raw Data
Transformed data
SageMaker
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Comparison of SQL Processing engines
Data Structure Semi Semi Semi Full
Languages API/SQL SQL SQL SQL
Data Store
S3 (Glue),
S3/HDFS (Spark)
S3/HDFS S3 Local
Use case Transformation
SQL Queries for
S3/HDFS
Serverless SQL
Queries for S3
Fully Featured
SQL Database
Performance
Glue Amazon Athena Amazon Redshift
Amazon EMR/ Amazon EMR
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3. Athena in Action
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create External Tables
• Use Apache Hive DDL to create table
• Run DDL statements using the Athena console
• Via a JDBC driver, using SQL workbench
• Using the Athena create table wizard
• Create “external” table in DDL
• Creates a view of the data
• Deleting table doesn’t delete the data in S3
• Schema-on-read
• Projects your schema onto data at query execution time
• No need for data loading or transformation
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Navigate to ’Saved Queries’ to get DDL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create External Table using DDL
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New Flights Parquet Table Created
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Save Query Results
• When you run an Athena query for the first time, an S3 bucket called
"aws-athena-query-results-<account_id>" is created on your account,
where "<account_id>" is replaced with your AWS account ID.
• The results of all Athena queries are stored in this S3 bucket using the
year, month, and day the query was run, the hexadecimal Athena
query ID, and the region the query was run in the following format:
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena
S3
$ aws s3 ls s3://ads/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
PRE dt=2009-04-12-14-10/
PRE dt=2009-04-12-14-15/
PRE dt=2009-04-12-14-20/
PRE dt=2009-04-12-15-00/
PRE dt=2009-04-12-15-05/
…..
-------------------------------------------
2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc
2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV
2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp
2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH
2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr
2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G
2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b
2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i
2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj
2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk
SELECT dt, impressionId
FROM impressions
WHERE dt < '2009-04-12-14-00’
AND dt >='2009-04-12-13-00’
ORDER BY dt DESC
LIMIT 100;
How to execute query?
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SELECT dt, impressionId
FROM impressions
WHERE dt < '2009-04-12-14-00’
AND dt >='2009-04-12-13-00’
ORDER BY dt DESC
LIMIT 100;
Data Schema & Location
CREATE EXTERNAL TABLE impressions (
requestBeginTime timestamp,
adId string,
impressionId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestEndTime timestamp,
timers struct<modelLookup:string,
requestTime:float>,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’
LOCATION 's3://ads/tables/impressions/';
Athena
S3
Schema-On Read
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.1 Tuning and Design Patterns
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Formats and SerDes
• Current data formats and SerDes
• Standard
• CSV, TSV
• JSON
• Apache web logs
• Custom-Delimited Files
• Open source columnar formats
• Parquet
• ORC
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Formats in Create Table
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Column Data Types
• Column Data Types
• Primitives ( Integer, Double, Date, String, Decimal, Varchar…)
• ARRAY
• MAP
• STRUCT
• Nested JSON
• Query using structures and array of structures
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example
CREATE EXTERNAL TABLE impressions (
requestBeginTime timestamp,
adId string,
impressionId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestEndTime timestamp,
timers struct<modelLookup:string,
requestTime:float>,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde
'org.apache.hive.hcatalog.data.JsonSerDe’
LOCATION 's3://ads/tables/impressions/';
Athena
S3
Primitives
Struct
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.2 Partitions
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partitions
• Partition on any column in table
• Limits the amount of data each query scans
• Common practice
• Multi-level time based partition
• Example
• PARTITIONED BY (year STRING, month STRING, day STRING)
• Performance and cost savings on queries filtering by day, month and
year
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example
CREATE EXTERNAL TABLE impressions (
requestBeginTime timestamp,
adId string,
impressionId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestEndTime timestamp,
timers struct<modelLookup:string,
requestTime:float>,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’
LOCATION 's3://ads/tables/impressions/';
Athena
S3
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partitions in Create Table
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How does partitioning work?
• Flight dataset in Parquet
• Partitioned by year
• Group by query
• Top 10 airports with the most
departures since 2000
• Where year >= 2000
• Athena doesn’t have to scan data in
partitions before year 2000
• Amount of data scanned affects pricing
• 1 min 6s, 3.11GB for unpartitioned
• 37.37 s, 2.25GB for CSV
• 9.32 secs, 0.04GB for Parquet
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How does partitioning work?
CSV: 37.37 s, 2.25GB No partitions: 1min 6s, 3.11GB
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Partition Projection
• Partition values and locations are calculated from configuration
• Partition pruning gathers metadata and "prunes" it to only the partitions
that apply to your query
• Allows Athena to avoid calling GetPartitions
To use partition projection, you specify the ranges of partition values
and projection types for each partition column in the table properties
in the AWS Glue Data Catalog or in your external Hive metastore.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog (Meta Store)
table 이름 데이터 위치
flight_csv s3://athena-examples/flight/ 2005
flight_csv s3://athena-examples/flight/ 2006
flight_csv s3://athena-examples/flight/ 2007
flight_csv s3://athena-examples/flight/ 2008
….
2016
flight_csv s3://athena-examples/flight/
파티션(year)
S3
$ aws s3 ls s3://athena-examples/flight/
PRE year=2005/
PRE year=2006/
PRE year=2007/
PRE year=2008/
PRE year=2009/
PRE year=2010/
PRE year=2011/
PRE year=2012/
PRE year=2013/
PRE year=2014/
PRE year=2015/
PRE year=2016/
S3의 디렉터리
구조에 맞게
파티션 정보 생성
새로운 파티션을 MetaStore에 추가하기
Data Catalog (Meta Store) – Schema, Location
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena
S3
Meta Store
(1) read all data
(2) filtering
Athena
S3
(1) read data
located in specific
partitions
(3) execute query (2) execute query
• Partitions ✓
• Scan할 데이터 양이 적음
• I/O 속도 향상
비용 감소
Meta Store
• Partitions ✗
• Scan할 데이터 양이 많음
• I/O 속도 저하
비용 증가
SELECT origin, count(*) as total_departures
FROM flight_csv
WHERE year >= ‘2000’
GROUP BY origin
ORDER BY total_departures DESC
LIMIT 10
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partition by Running MSCK Repair
• If your data is already partitioned in Hive format
• Use MSCK repair table command.
• To sync all new data in S3 with Hive metastore
• Example:
• msck repair table flights_csv
• show partitions flights_csv
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Partition by Manually Adding Partitions
• If data is not partitioned in Hive format
• Cannot use MSCK repair table command
• Use ALTER TABLE ADD PARTITION to add each partition manually
• Look at automating adding partitions
• Example:
– ALTER TABLE elb_logs_raw_native_part ADD PARTITION
(year='2015',month='01',day='01’) location 's3://athena-
examples/elb/raw/2015/01/01/'
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog (Meta Store)
table 이름 데이터 위치
flight_csv s3://athena-examples/flight/ 2005
flight_csv s3://athena-examples/flight/ 2006
flight_csv s3://athena-examples/flight/ 2007
flight_csv s3://athena-examples/flight/ 2008
….
2016
flight_csv s3://athena-examples/flight/
파티션(year)
S3
$ aws s3 ls s3://athena-examples/flight/
PRE year=2005/
PRE year=2006/
PRE year=2007/
PRE year=2008/
PRE year=2009/
PRE year=2010/
PRE year=2011/
PRE year=2012/
PRE year=2013/
PRE year=2014/
PRE year=2015/
PRE year=2016/
S3의 디렉터리
구조에 맞게
파티션 정보 생성
새로운 파티션을 MetaStore에 추가하기
Data Catalog (Meta Store) – Schema, Location
ALTER TABLE … ADD PARTITIONS …
MSCK REPAIR …
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Convert to Columnar Formats - Parquet/ORC
• Convert to Parquet/ORC, use EMR
• Create EMR cluster with Hive or Spark
• In the step section of create cluster
• Specify script in S3, Hive or PySpark
• Point to the input data in text/CSV/JSON
• Create output data in Parquet/ORC in S3
• Use Athena CTAS queries to create a new table with Parquet data from a source
table in a different format.
• CREATE TABLE new_table
WITH (
format = 'Parquet', parquet_compression = 'SNAPPY’)
AS SELECT * F
ROM old_table;
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Different workloads - OLTP vs. OLAP
OLTP
• Online transaction
processing
• Lots of small operations
involving whole rows
OLAP
• Online analytical processing
• Few large operations
involving subset of all
columns
Assumption: I/O is expensive (memory, disk, network, …)
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A0 B0 C0 A1 B1 C1 A2 B2 C2
A3 B3 C3 A4 B4 C4 A5 B5 C5
Row-wise vs. Columnar
Row-wise
• Horizontal partitioning
• OLTP ✓, OLAP ✗
Columnar
• Vertical partitioning
• OLTP ✗, OLAP ✓
• Free projection pushdown
• Compression opportunities
Columnar
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B5 C5
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Col A Col B Col C
A0 A1 A2 A3 A4 A5 B0 B1 B2
B3 B4 B5 C0 C1 C2 C3 C4 C5
Row-wise
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.3 Optimize and Secure
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices
• Partition your data
• Divides your table into parts and keeps the related data together based on column
values such as date, country, region, etc
• Bucket your data
• You can specify one or more columns containing rows that you want to group together,
and put those rows into multiple buckets.
• Compress and split files
• Can speed up your queries significantly
• Optimize File Sizes
• Ensuring that your file formats are splittable helps with parallelism
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices Continued
• Optimize columnar data store generation
• Apache Parquet and Apache ORC
• Optimize Queries
• ORDER BY
• JOIN
• GROUP BY
• LIKE
• Only include the columns that you need
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Access Control for Athena
• Access Control
• AWS Identity and Access Management (IAM)
• Access Control Lists (ACLs)
• Amazon S3 bucket policies.
• IAM policies for S3
• IAM is natively integrated with S3
• Grant IAM users fine-grained control to S3
• Restrict users from querying it using Athena
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Access Control Continued
• To run queries in Athena, you must have the appropriate permissions for the
following:
• Athena API actions including additional actions for Athena workgroups
• Amazon S3 locations where the underlying data to query is stored.
• Metadata and resources that you store in the AWS Glue Data Catalog,
such as databases and tables, including additional actions for encrypted
metadata.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.4 Workgroups
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena Workgroups
Athena Workgroups are used to isolate
queries between different teams, workloads
or applications, and to set limits on amount
of data each query or the entire workgroup
can process
Workload Isolation Query Metrics Cost Controls
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workgroups – Workload Isolation
Unique query output
location per
Workgroup
Encrypt results with
unique AWS KMS key
per Workgroup
Collect and publish
aggregated metrics per
Workgroup to AWS
CloudWatch
Use Workgroup settings
eliminating need to
configure individual
users
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workgroups – Metric Reporting
Total bytes scanned
per Workgroup
Total failed queries
per Workgroup
Total successful
queries per
Workgroup
Total query execution
time per Workgroup
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workgroups – Cost Controls
• Per query data scanned threshold; exceeding, will cancel query
• Trigger alarms to notify of increasing usage and cost
• Disable Workgroup when all queries exceed a maximum threshold
Any Athena metric: successful/failed & total queries, query run time, etc.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Setting Up Workgroups
1. Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup
configuration. This determines IAM policies you create.
2. Create workgroups as needed, and add tags to them.
• Open the Athena console, choose the Workgroup:<workgroup_name> tab,
and then choose Create workgroup.
3. Create IAM policies for your users, groups, or roles to enable their access to
workgroups.
4. Specify a location in Amazon S3 for query results and encryption settings
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Default Workgroup
By default, if you have not created any workgroups, all queries in your account run in
the primary workgroup
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.5 Views
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
View
• A view in Amazon Athena is a logical, not a physical table.
• The query that defines a view runs each time the view is referenced in a query.
• Views do not contain any data and do not write data.
• Instead, the query specified by the view runs each time you reference
the view by another query.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create a View
• Creates a new view from a specified SELECT query.
• The view is a logical table that can be referenced by future queries.
• Flow:
• Create View:
• Write View:
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.6 CTAS
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CREATE TABLE AS SELECT (CTAS)
• Creates a new table in Athena from the results of a SELECT statement
from another query
• Athena stores data files created by the CTAS statement in a specified
location in Amazon S3
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CREATE TABLE new_table
WITH (
external_location =
's3://my_athena_results/my_parquet_stas_table/',
format = 'Parquet',
parquet_compression = 'SNAPPY'
)
AS SELECT *
FROM old_table;
See more examples in
https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
CTAS Example
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use CTAS queries to:
• Create tables from query results in one step, without repeatedly
querying raw data sets. This makes it easier to work with raw data
sets.
• Transform query results into other storage formats, such as Parquet
and ORC. This improves query performance and reduces query costs
in Athena.
• Create copies of existing tables that contain only the data you need.
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Processing: ETL vs. ELT
Source1
Source2 Target
Source1
Source2
Staging
tables
Final
tables
Target
(MPP database)
Extract Transform Load
E → T → L
Extract & Load Transform
E → L → T
Source3
Source3
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3.7 Monitoring & Auditing
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workgroups
• Separate users, teams, applications, or workloads
• Enable queries on Requester Pays buckets in Amazon S3
• In order to:
• Set limits on amount of data each query or the entire workgroup can
process
• Track costs
Ø Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup
configuration. This determines IAM policies you create.
Ø Create workgroups as needed, and add tags to them.
• Open the Athena console, choose the Workgroup:<workgroup_name> tab, and
then choose Create workgroup.
Ø Create IAM policies for your users, groups, or roles to enable their access to workgroups.
Ø Specify a location in Amazon S3 for query results and encryption settings
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CloudTrail
Monitor Athena with AWS CloudTrail
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring Athena Queries with CloudWatch Metrics
Use CloudWatch Events with Athena:
• EngineExecutionTime – in milliseconds
• ProcessedBytes – the total amount of data scanned per DML query
• QueryPlanningTime – in milliseconds
• QueryQueueTime – in milliseconds
• ServiceProcessingTime – in milliseconds
• TotalExecutionTime – in milliseconds, for DDL and DML queries
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Monitoring Athena Queries with CloudWatch Events
• Use simple rules to match events and route
them to one or more target functions or
streams.
• Respond to operational changes and takes
corrective action as necessary, by
• Sending messages to respond to the environment
• Activating functions
• Making changes
• Capturing state information
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
4. Summary
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key Benefits of Athena
• Fast ad-hoc queries
• Query directly against data in S3
• Use standard ANSI SQL
• No infrastructure to setup and manage
• Use IAM for security
• JDBC driver for BI tools and SQL clients
• Pay per query, depending on data scanned
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started
Tutorials
• Amazon Athena in Action: Workshop
• Amazon Athena Getting Started Cookbook
Examples
• Data visualization and anomaly detection using Amazon Athena and
Pandas from Amazon SageMaker (2020-09-25)
• Prepare data for model-training and invoke machine learning models
with Amazon Athena (2019-11-26)
Tools
• AWS Data Wrangler: Pandas on AWS
• PyAthena: a python client for Amazon Athena
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?

More Related Content

What's hot

Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...
Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...
Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...Amazon Web Services Korea
 
[AWS Migration Workshop] 데이터센터의 SAP를 AWS로 마이그레이션 하기
[AWS Migration Workshop]  데이터센터의 SAP를 AWS로 마이그레이션 하기[AWS Migration Workshop]  데이터센터의 SAP를 AWS로 마이그레이션 하기
[AWS Migration Workshop] 데이터센터의 SAP를 AWS로 마이그레이션 하기Amazon Web Services Korea
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...Amazon Web Services
 
20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS GlueAmazon Web Services Japan
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon ForecastAmazon Web Services Japan
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...Amazon Web Services Japan
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Amazon Web Services
 
London Microservices Meetup: Lessons learnt adopting microservices
London Microservices  Meetup: Lessons learnt adopting microservicesLondon Microservices  Meetup: Lessons learnt adopting microservices
London Microservices Meetup: Lessons learnt adopting microservicesCobus Bernard
 
AWSome Day Cork | Technical Track
AWSome Day Cork | Technical TrackAWSome Day Cork | Technical Track
AWSome Day Cork | Technical TrackAmazon Web Services
 
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018Amazon Web Services Korea
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Amazon Web Services
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...Amazon Web Services
 

What's hot (20)

Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...
Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...
Amazon의 머신러닝 솔루션: Fraud Detection & Predictive Maintenance - 남궁영환 (AWS 데이터 사이...
 
[AWS Migration Workshop] 데이터센터의 SAP를 AWS로 마이그레이션 하기
[AWS Migration Workshop]  데이터센터의 SAP를 AWS로 마이그레이션 하기[AWS Migration Workshop]  데이터센터의 SAP를 AWS로 마이그레이션 하기
[AWS Migration Workshop] 데이터센터의 SAP를 AWS로 마이그레이션 하기
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 
20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue20190806 AWS Black Belt Online Seminar AWS Glue
20190806 AWS Black Belt Online Seminar AWS Glue
 
20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast20200128 AWS Black Belt Online Seminar Amazon Forecast
20200128 AWS Black Belt Online Seminar Amazon Forecast
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Best of AWS re:Invent 2017
Best of AWS re:Invent 2017Best of AWS re:Invent 2017
Best of AWS re:Invent 2017
 
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
AWS Batch를 통한 손쉬운 일괄 처리 작업 관리하기 - 윤석찬 (AWS 테크에반젤리스트)
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
 
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...
20191002 AWS Black Belt Online Seminar Amazon EC2 Auto Scaling and AWS Auto S...
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
Best practices for Running Spark jobs on Amazon EMR with Spot Instances | AWS...
 
London Microservices Meetup: Lessons learnt adopting microservices
London Microservices  Meetup: Lessons learnt adopting microservicesLondon Microservices  Meetup: Lessons learnt adopting microservices
London Microservices Meetup: Lessons learnt adopting microservices
 
AWSome Day Cork | Technical Track
AWSome Day Cork | Technical TrackAWSome Day Cork | Technical Track
AWSome Day Cork | Technical Track
 
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
 
2020 re:Cap
2020 re:Cap2020 re:Cap
2020 re:Cap
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 

Similar to Introduction to Amazon Athena

Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...javier ramirez
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析Amazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Amazon Web Services
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfSasikumarPalanivel3
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfsaidbilgen
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksAmazon Web Services
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 

Similar to Introduction to Amazon Athena (20)

Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
Consulta cualquier fuente de datos usando SQL con Amazon Athena y sus consult...
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies Replicate and Manage Data Using Managed Databases and Serverless Technologies
Replicate and Manage Data Using Managed Databases and Serverless Technologies
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 

More from Sungmin Kim

End-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerEnd-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerSungmin Kim
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기Sungmin Kim
 
AWS re:Invent 2020 Awesome AI/ML Services
AWS re:Invent 2020 Awesome AI/ML ServicesAWS re:Invent 2020 Awesome AI/ML Services
AWS re:Invent 2020 Awesome AI/ML ServicesSungmin Kim
 
Starup을 위한 AWS AI/ML 서비스 활용 방법
Starup을 위한 AWS AI/ML 서비스 활용 방법Starup을 위한 AWS AI/ML 서비스 활용 방법
Starup을 위한 AWS AI/ML 서비스 활용 방법Sungmin Kim
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKSungmin Kim
 
Octember on AWS (Revised Edition)
Octember on AWS (Revised Edition)Octember on AWS (Revised Edition)
Octember on AWS (Revised Edition)Sungmin Kim
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Amazon Athena 사용 팁
Amazon Athena 사용 팁Amazon Athena 사용 팁
Amazon Athena 사용 팁Sungmin Kim
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapSungmin Kim
 
AI/ML re:invent 2019 recap at Delivery Hero Korea
AI/ML re:invent 2019 recap at Delivery Hero KoreaAI/ML re:invent 2019 recap at Delivery Hero Korea
AI/ML re:invent 2019 recap at Delivery Hero KoreaSungmin Kim
 

More from Sungmin Kim (12)

End-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMakerEnd-to-End Machine Learning with Amazon SageMaker
End-to-End Machine Learning with Amazon SageMaker
 
1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기1시간만에 머신러닝 개념 따라 잡기
1시간만에 머신러닝 개념 따라 잡기
 
AWS re:Invent 2020 Awesome AI/ML Services
AWS re:Invent 2020 Awesome AI/ML ServicesAWS re:Invent 2020 Awesome AI/ML Services
AWS re:Invent 2020 Awesome AI/ML Services
 
Starup을 위한 AWS AI/ML 서비스 활용 방법
Starup을 위한 AWS AI/ML 서비스 활용 방법Starup을 위한 AWS AI/ML 서비스 활용 방법
Starup을 위한 AWS AI/ML 서비스 활용 방법
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
 
Octember on AWS (Revised Edition)
Octember on AWS (Revised Edition)Octember on AWS (Revised Edition)
Octember on AWS (Revised Edition)
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Amazon Athena 사용 팁
Amazon Athena 사용 팁Amazon Athena 사용 팁
Amazon Athena 사용 팁
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 Recap
 
Octember on AWS
Octember on AWSOctember on AWS
Octember on AWS
 
AI/ML re:invent 2019 recap at Delivery Hero Korea
AI/ML re:invent 2019 recap at Delivery Hero KoreaAI/ML re:invent 2019 recap at Delivery Hero Korea
AI/ML re:invent 2019 recap at Delivery Hero Korea
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Introduction to Amazon Athena

  • 1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive query service to analyze data in Amazon S3 using standard SQL Amazon Athena
  • 2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Amazon Athena Overview • Athena Design Patterns • Athena in Action • Create External Tables • Data Formats and SerDes • Partitions • Optimize and Secure • Workgroups • Views • CTAS • Monitoring & Auditing • Summary
  • 3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Introduction to Amazon Athena
  • 4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena Serverless Distributed (SQL) Query Engine for Amazon S3 Distributed SQL Query Engine Amazon S3
  • 5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Amazon Athena ? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Secure – IAM/SAMLv2 for authentication; Encryption at rest & in transit • Standard compliant and open storage file formats • Built on powerful community supported OSS solutions
  • 6. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfers apply
  • 7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Runs standard SQL • Uses Presto with ANSI SQL support • Works with standard data formats • CSV • Apache Weblogs • JSON • Parquet • ORC • Handles complex queries • Large Joins • Window functions • Arrays
  • 8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fast Performance for Large Data Sets • Fast, ad-hoc queries • Executes queries in parallel • No provisioning extra resources for complex queries • Scales automatically • Amazon Athena Federated Query
  • 9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage Query Engine Scale-up Scale-out How to handle Large Data Sets?
  • 10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage Storage Interface Layer Storage Compute Decouple storage from compute Compute Scale-out
  • 11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage Interface Layer Scale-out Storage Scale-out Compute Distributed Processing Framework Distributed File System (DFS) ex) HDFS, Amazon S3, … ex) Hadoop MapReduce, Apache Spark, Flink Apache Hive, Presto Amazon Athena, … Distributed Data Processing System
  • 12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2. Athena Design Patterns
  • 13. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1: Ad-hoc use-case
  • 14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Athena AWS Glue Data Catalog Query data Hot data Warm & cold data Application request 2: SaaS use-case
  • 15. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS service logs Application logs Data sourced from external vendors S3 Athena Update table partition Query data S3 Athena CTAS and INSERT INTO to ETL 3: ETL and query use-case Glue Data Catalog
  • 16. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Athena Query data S3 4: Data science exploration and feature engineering Glue Data Catalog Raw Data Transformed data SageMaker
  • 17. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Comparison of SQL Processing engines Data Structure Semi Semi Semi Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance Glue Amazon Athena Amazon Redshift Amazon EMR/ Amazon EMR
  • 18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3. Athena in Action
  • 19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create External Tables • Use Apache Hive DDL to create table • Run DDL statements using the Athena console • Via a JDBC driver, using SQL workbench • Using the Athena create table wizard • Create “external” table in DDL • Creates a view of the data • Deleting table doesn’t delete the data in S3 • Schema-on-read • Projects your schema onto data at query execution time • No need for data loading or transformation
  • 20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Navigate to ’Saved Queries’ to get DDL
  • 21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create External Table using DDL
  • 22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Flights Parquet Table Created
  • 23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Save Query Results • When you run an Athena query for the first time, an S3 bucket called "aws-athena-query-results-<account_id>" is created on your account, where "<account_id>" is replaced with your AWS account ID. • The results of all Athena queries are stored in this S3 bucket using the year, month, and day the query was run, the hexadecimal Athena query ID, and the region the query was run in the following format:
  • 24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena S3 $ aws s3 ls s3://ads/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ PRE dt=2009-04-12-14-10/ PRE dt=2009-04-12-14-15/ PRE dt=2009-04-12-14-20/ PRE dt=2009-04-12-15-00/ PRE dt=2009-04-12-15-05/ ….. ------------------------------------------- 2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc 2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV 2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp 2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH 2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr 2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G 2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b 2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i 2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj 2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk SELECT dt, impressionId FROM impressions WHERE dt < '2009-04-12-14-00’ AND dt >='2009-04-12-13-00’ ORDER BY dt DESC LIMIT 100; How to execute query?
  • 25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SELECT dt, impressionId FROM impressions WHERE dt < '2009-04-12-14-00’ AND dt >='2009-04-12-13-00’ ORDER BY dt DESC LIMIT 100; Data Schema & Location CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3 Schema-On Read
  • 26. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.1 Tuning and Design Patterns
  • 27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Formats and SerDes • Current data formats and SerDes • Standard • CSV, TSV • JSON • Apache web logs • Custom-Delimited Files • Open source columnar formats • Parquet • ORC
  • 28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Formats in Create Table
  • 29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Column Data Types • Column Data Types • Primitives ( Integer, Double, Date, String, Decimal, Varchar…) • ARRAY • MAP • STRUCT • Nested JSON • Query using structures and array of structures
  • 30. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3 Primitives Struct
  • 31. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.2 Partitions
  • 32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partitions • Partition on any column in table • Limits the amount of data each query scans • Common practice • Multi-level time based partition • Example • PARTITIONED BY (year STRING, month STRING, day STRING) • Performance and cost savings on queries filtering by day, month and year
  • 33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Example CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3
  • 34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partitions in Create Table
  • 35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How does partitioning work? • Flight dataset in Parquet • Partitioned by year • Group by query • Top 10 airports with the most departures since 2000 • Where year >= 2000 • Athena doesn’t have to scan data in partitions before year 2000 • Amount of data scanned affects pricing • 1 min 6s, 3.11GB for unpartitioned • 37.37 s, 2.25GB for CSV • 9.32 secs, 0.04GB for Parquet
  • 36. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How does partitioning work? CSV: 37.37 s, 2.25GB No partitions: 1min 6s, 3.11GB
  • 37. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Partition Projection • Partition values and locations are calculated from configuration • Partition pruning gathers metadata and "prunes" it to only the partitions that apply to your query • Allows Athena to avoid calling GetPartitions To use partition projection, you specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in your external Hive metastore.
  • 38. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog (Meta Store) table 이름 데이터 위치 flight_csv s3://athena-examples/flight/ 2005 flight_csv s3://athena-examples/flight/ 2006 flight_csv s3://athena-examples/flight/ 2007 flight_csv s3://athena-examples/flight/ 2008 …. 2016 flight_csv s3://athena-examples/flight/ 파티션(year) S3 $ aws s3 ls s3://athena-examples/flight/ PRE year=2005/ PRE year=2006/ PRE year=2007/ PRE year=2008/ PRE year=2009/ PRE year=2010/ PRE year=2011/ PRE year=2012/ PRE year=2013/ PRE year=2014/ PRE year=2015/ PRE year=2016/ S3의 디렉터리 구조에 맞게 파티션 정보 생성 새로운 파티션을 MetaStore에 추가하기 Data Catalog (Meta Store) – Schema, Location
  • 39. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena S3 Meta Store (1) read all data (2) filtering Athena S3 (1) read data located in specific partitions (3) execute query (2) execute query • Partitions ✓ • Scan할 데이터 양이 적음 • I/O 속도 향상 비용 감소 Meta Store • Partitions ✗ • Scan할 데이터 양이 많음 • I/O 속도 저하 비용 증가 SELECT origin, count(*) as total_departures FROM flight_csv WHERE year >= ‘2000’ GROUP BY origin ORDER BY total_departures DESC LIMIT 10
  • 40. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partition by Running MSCK Repair • If your data is already partitioned in Hive format • Use MSCK repair table command. • To sync all new data in S3 with Hive metastore • Example: • msck repair table flights_csv • show partitions flights_csv
  • 41. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Partition by Manually Adding Partitions • If data is not partitioned in Hive format • Cannot use MSCK repair table command • Use ALTER TABLE ADD PARTITION to add each partition manually • Look at automating adding partitions • Example: – ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01’) location 's3://athena- examples/elb/raw/2015/01/01/'
  • 42. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Catalog (Meta Store) table 이름 데이터 위치 flight_csv s3://athena-examples/flight/ 2005 flight_csv s3://athena-examples/flight/ 2006 flight_csv s3://athena-examples/flight/ 2007 flight_csv s3://athena-examples/flight/ 2008 …. 2016 flight_csv s3://athena-examples/flight/ 파티션(year) S3 $ aws s3 ls s3://athena-examples/flight/ PRE year=2005/ PRE year=2006/ PRE year=2007/ PRE year=2008/ PRE year=2009/ PRE year=2010/ PRE year=2011/ PRE year=2012/ PRE year=2013/ PRE year=2014/ PRE year=2015/ PRE year=2016/ S3의 디렉터리 구조에 맞게 파티션 정보 생성 새로운 파티션을 MetaStore에 추가하기 Data Catalog (Meta Store) – Schema, Location ALTER TABLE … ADD PARTITIONS … MSCK REPAIR …
  • 43. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Convert to Columnar Formats - Parquet/ORC • Convert to Parquet/ORC, use EMR • Create EMR cluster with Hive or Spark • In the step section of create cluster • Specify script in S3, Hive or PySpark • Point to the input data in text/CSV/JSON • Create output data in Parquet/ORC in S3 • Use Athena CTAS queries to create a new table with Parquet data from a source table in a different format. • CREATE TABLE new_table WITH ( format = 'Parquet', parquet_compression = 'SNAPPY’) AS SELECT * F ROM old_table;
  • 44. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Different workloads - OLTP vs. OLAP OLTP • Online transaction processing • Lots of small operations involving whole rows OLAP • Online analytical processing • Few large operations involving subset of all columns Assumption: I/O is expensive (memory, disk, network, …)
  • 45. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 Row-wise vs. Columnar Row-wise • Horizontal partitioning • OLTP ✓, OLAP ✗ Columnar • Vertical partitioning • OLTP ✗, OLAP ✓ • Free projection pushdown • Compression opportunities Columnar A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Col A Col B Col C A0 A1 A2 A3 A4 A5 B0 B1 B2 B3 B4 B5 C0 C1 C2 C3 C4 C5 Row-wise
  • 46. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.3 Optimize and Secure
  • 47. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices • Partition your data • Divides your table into parts and keeps the related data together based on column values such as date, country, region, etc • Bucket your data • You can specify one or more columns containing rows that you want to group together, and put those rows into multiple buckets. • Compress and split files • Can speed up your queries significantly • Optimize File Sizes • Ensuring that your file formats are splittable helps with parallelism
  • 48. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices Continued • Optimize columnar data store generation • Apache Parquet and Apache ORC • Optimize Queries • ORDER BY • JOIN • GROUP BY • LIKE • Only include the columns that you need
  • 49. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Access Control for Athena • Access Control • AWS Identity and Access Management (IAM) • Access Control Lists (ACLs) • Amazon S3 bucket policies. • IAM policies for S3 • IAM is natively integrated with S3 • Grant IAM users fine-grained control to S3 • Restrict users from querying it using Athena
  • 50. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Access Control Continued • To run queries in Athena, you must have the appropriate permissions for the following: • Athena API actions including additional actions for Athena workgroups • Amazon S3 locations where the underlying data to query is stored. • Metadata and resources that you store in the AWS Glue Data Catalog, such as databases and tables, including additional actions for encrypted metadata.
  • 51. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.4 Workgroups
  • 52. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena Workgroups Athena Workgroups are used to isolate queries between different teams, workloads or applications, and to set limits on amount of data each query or the entire workgroup can process Workload Isolation Query Metrics Cost Controls
  • 53. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workgroups – Workload Isolation Unique query output location per Workgroup Encrypt results with unique AWS KMS key per Workgroup Collect and publish aggregated metrics per Workgroup to AWS CloudWatch Use Workgroup settings eliminating need to configure individual users
  • 54. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workgroups – Metric Reporting Total bytes scanned per Workgroup Total failed queries per Workgroup Total successful queries per Workgroup Total query execution time per Workgroup
  • 55. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workgroups – Cost Controls • Per query data scanned threshold; exceeding, will cancel query • Trigger alarms to notify of increasing usage and cost • Disable Workgroup when all queries exceed a maximum threshold Any Athena metric: successful/failed & total queries, query run time, etc.
  • 56. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Setting Up Workgroups 1. Decide which workgroups to create. For example, you can decide the following: • Who can run queries in each workgroup, and who owns workgroup configuration. This determines IAM policies you create. 2. Create workgroups as needed, and add tags to them. • Open the Athena console, choose the Workgroup:<workgroup_name> tab, and then choose Create workgroup. 3. Create IAM policies for your users, groups, or roles to enable their access to workgroups. 4. Specify a location in Amazon S3 for query results and encryption settings
  • 57. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Default Workgroup By default, if you have not created any workgroups, all queries in your account run in the primary workgroup
  • 58. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.5 Views
  • 59. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. View • A view in Amazon Athena is a logical, not a physical table. • The query that defines a view runs each time the view is referenced in a query. • Views do not contain any data and do not write data. • Instead, the query specified by the view runs each time you reference the view by another query.
  • 60. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Create a View • Creates a new view from a specified SELECT query. • The view is a logical table that can be referenced by future queries. • Flow: • Create View: • Write View:
  • 61. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.6 CTAS
  • 62. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CREATE TABLE AS SELECT (CTAS) • Creates a new table in Athena from the results of a SELECT statement from another query • Athena stores data files created by the CTAS statement in a specified location in Amazon S3
  • 63. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CREATE TABLE new_table WITH ( external_location = 's3://my_athena_results/my_parquet_stas_table/', format = 'Parquet', parquet_compression = 'SNAPPY' ) AS SELECT * FROM old_table; See more examples in https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html CTAS Example
  • 64. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use CTAS queries to: • Create tables from query results in one step, without repeatedly querying raw data sets. This makes it easier to work with raw data sets. • Transform query results into other storage formats, such as Parquet and ORC. This improves query performance and reduces query costs in Athena. • Create copies of existing tables that contain only the data you need.
  • 65. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Processing: ETL vs. ELT Source1 Source2 Target Source1 Source2 Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3
  • 66. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3.7 Monitoring & Auditing
  • 67. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workgroups • Separate users, teams, applications, or workloads • Enable queries on Requester Pays buckets in Amazon S3 • In order to: • Set limits on amount of data each query or the entire workgroup can process • Track costs Ø Decide which workgroups to create. For example, you can decide the following: • Who can run queries in each workgroup, and who owns workgroup configuration. This determines IAM policies you create. Ø Create workgroups as needed, and add tags to them. • Open the Athena console, choose the Workgroup:<workgroup_name> tab, and then choose Create workgroup. Ø Create IAM policies for your users, groups, or roles to enable their access to workgroups. Ø Specify a location in Amazon S3 for query results and encryption settings
  • 68. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CloudTrail Monitor Athena with AWS CloudTrail
  • 69. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Monitoring Athena Queries with CloudWatch Metrics Use CloudWatch Events with Athena: • EngineExecutionTime – in milliseconds • ProcessedBytes – the total amount of data scanned per DML query • QueryPlanningTime – in milliseconds • QueryQueueTime – in milliseconds • ServiceProcessingTime – in milliseconds • TotalExecutionTime – in milliseconds, for DDL and DML queries
  • 70. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Monitoring Athena Queries with CloudWatch Events • Use simple rules to match events and route them to one or more target functions or streams. • Respond to operational changes and takes corrective action as necessary, by • Sending messages to respond to the environment • Activating functions • Making changes • Capturing state information
  • 71. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 4. Summary
  • 72. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key Benefits of Athena • Fast ad-hoc queries • Query directly against data in S3 • Use standard ANSI SQL • No infrastructure to setup and manage • Use IAM for security • JDBC driver for BI tools and SQL clients • Pay per query, depending on data scanned
  • 73. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started Tutorials • Amazon Athena in Action: Workshop • Amazon Athena Getting Started Cookbook Examples • Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker (2020-09-25) • Prepare data for model-training and invoke machine learning models with Amazon Athena (2019-11-26) Tools • AWS Data Wrangler: Pandas on AWS • PyAthena: a python client for Amazon Athena
  • 74. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions?