Introduction to Amazon Athena

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive query service to analyze data in Amazon S3 using standard SQL
Amazon Athena

Agenda
• Amazon Athena Overview
• Athena Design Patterns
• Athena in Action
• Create External Tables
• Data Formats and SerDes
• Partitions
• Optimize and Secure
• Workgroups
• Views
• CTAS
• Monitoring & Auditing
• Summary

1. Introduction to Amazon Athena

Amazon Athena
Serverless Distributed (SQL) Query Engine for Amazon S3
Distributed SQL Query Engine Amazon S3

Why Amazon Athena ?
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM/SAMLv2 for authentication; Encryption at rest
& in transit
• Standard compliant and open storage file formats
• Built on powerful community supported OSS solutions

Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data
transfers apply

Runs standard SQL
• Uses Presto with ANSI SQL support
• Works with standard data formats
• CSV
• Apache Weblogs
• JSON
• Parquet
• ORC
• Handles complex queries
• Large Joins
• Window functions
• Arrays

Fast Performance for Large Data Sets
• Fast, ad-hoc queries
• Executes queries in parallel
• No provisioning extra resources for complex queries
• Scales automatically
• Amazon Athena Federated Query

Storage
Query
Engine Scale-up
Scale-out
How to handle Large Data Sets?

Storage
Storage Interface Layer
Storage
Compute
Decouple storage from compute
Compute
Scale-out

Storage Interface Layer
Scale-out
Storage
Scale-out
Compute
Distributed Processing Framework
Distributed File System (DFS)
ex) HDFS, Amazon S3, …
ex) Hadoop MapReduce,
Apache Spark, Flink
Apache Hive, Presto
Amazon Athena, …
Distributed Data Processing System

2. Athena Design Patterns

1: Ad-hoc use-case

S3
Athena
AWS Glue Data Catalog
Query data
Hot data
Warm & cold data
Application request
2: SaaS use-case

AWS service logs
Application logs
Data sourced from external
vendors
S3
Athena
Update table partition
Query data
S3
Athena CTAS and INSERT INTO
to ETL
3: ETL and query use-case
Glue Data Catalog

S3 Athena
Query data
S3
4: Data science exploration and feature engineering
Glue Data Catalog
Raw Data
Transformed data
SageMaker

Comparison of SQL Processing engines
Data Structure Semi Semi Semi Full
Languages API/SQL SQL SQL SQL
Data Store
S3 (Glue),
S3/HDFS (Spark)
S3/HDFS S3 Local
Use case Transformation
SQL Queries for
S3/HDFS
Serverless SQL
Queries for S3
Fully Featured
SQL Database
Performance
Glue Amazon Athena Amazon Redshift
Amazon EMR/ Amazon EMR

3. Athena in Action

Create External Tables
• Use Apache Hive DDL to create table
• Run DDL statements using the Athena console
• Via a JDBC driver, using SQL workbench
• Using the Athena create table wizard
• Create “external” table in DDL
• Creates a view of the data
• Deleting table doesn’t delete the data in S3
• Schema-on-read
• Projects your schema onto data at query execution time
• No need for data loading or transformation

Navigate to ’Saved Queries’ to get DDL

Create External Table using DDL

New Flights Parquet Table Created

Save Query Results
• When you run an Athena query for the first time, an S3 bucket called
"aws-athena-query-results-<account_id>" is created on your account,
where "<account_id>" is replaced with your AWS account ID.
• The results of all Athena queries are stored in this S3 bucket using the
year, month, and day the query was run, the hexadecimal Athena
query ID, and the region the query was run in the following format:

Athena
S3
$ aws s3 ls s3://ads/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
PRE dt=2009-04-12-14-10/
PRE dt=2009-04-12-14-15/
PRE dt=2009-04-12-14-20/
PRE dt=2009-04-12-15-00/
PRE dt=2009-04-12-15-05/
…..
-------------------------------------------
2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc
2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV
2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp
2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH
2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr
2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G
2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b
2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i
2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj
2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk
SELECT dt, impressionId
FROM impressions
WHERE dt < '2009-04-12-14-00’
AND dt >='2009-04-12-13-00’
ORDER BY dt DESC
LIMIT 100;
How to execute query?

SELECT dt, impressionId
FROM impressions
WHERE dt < '2009-04-12-14-00’
AND dt >='2009-04-12-13-00’
ORDER BY dt DESC
LIMIT 100;
Data Schema & Location
CREATE EXTERNAL TABLE impressions (
requestBeginTime timestamp,
adId string,
impressionId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestEndTime timestamp,
timers struct<modelLookup:string,
requestTime:float>,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’
LOCATION 's3://ads/tables/impressions/';
Athena
S3
Schema-On Read

3.1 Tuning and Design Patterns

Data Formats and SerDes
• Current data formats and SerDes
• Standard
• CSV, TSV
• JSON
• Apache web logs
• Custom-Delimited Files
• Open source columnar formats
• Parquet
• ORC

Data Formats in Create Table

Column Data Types
• Column Data Types
• Primitives ( Integer, Double, Date, String, Decimal, Varchar…)
• ARRAY
• MAP
• STRUCT
• Nested JSON
• Query using structures and array of structures

Example
adId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestTime:float>,
hostname string,
sessionId string)
ROW FORMAT serde
'org.apache.hive.hcatalog.data.JsonSerDe’
Athena
S3
Primitives
Struct

3.2 Partitions

Partitions
• Partition on any column in table
• Limits the amount of data each query scans
• Common practice
• Multi-level time based partition
• Example
• PARTITIONED BY (year STRING, month STRING, day STRING)
• Performance and cost savings on queries filtering by day, month and
year

Example
adId string,
referrer string,
userAgent string,
ip varchar(15),
number integer,
modelId string,
requestTime:float>,
hostname string,
sessionId string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’
Athena
S3

Partitions in Create Table

How does partitioning work?
• Flight dataset in Parquet
• Partitioned by year
• Group by query
• Top 10 airports with the most
departures since 2000
• Where year >= 2000
• Athena doesn’t have to scan data in
partitions before year 2000
• Amount of data scanned affects pricing
• 1 min 6s, 3.11GB for unpartitioned
• 37.37 s, 2.25GB for CSV
• 9.32 secs, 0.04GB for Parquet

How does partitioning work?
CSV: 37.37 s, 2.25GB No partitions: 1min 6s, 3.11GB

Using Partition Projection
• Partition values and locations are calculated from configuration
• Partition pruning gathers metadata and "prunes" it to only the partitions
that apply to your query
• Allows Athena to avoid calling GetPartitions
To use partition projection, you specify the ranges of partition values
and projection types for each partition column in the table properties
in the AWS Glue Data Catalog or in your external Hive metastore.

Data Catalog (Meta Store)
table 이름 데이터 위치
flight_csv s3://athena-examples/flight/ 2005
….
2016
flight_csv s3://athena-examples/flight/
파티션(year)
S3
$ aws s3 ls s3://athena-examples/flight/
PRE year=2005/
PRE year=2006/
PRE year=2007/
PRE year=2008/
PRE year=2009/
PRE year=2010/
PRE year=2011/
PRE year=2012/
PRE year=2013/
PRE year=2014/
PRE year=2015/
PRE year=2016/
S3의 디렉터리
구조에 맞게
파티션 정보 생성
새로운 파티션을 MetaStore에 추가하기
Data Catalog (Meta Store) – Schema, Location

Athena
S3
Meta Store
(1) read all data
(2) filtering
Athena
S3
(1) read data
located in specific
partitions
(3) execute query (2) execute query
• Partitions ✓
• Scan할 데이터 양이 적음
• I/O 속도 향상
비용 감소
Meta Store
• Partitions ✗
• Scan할 데이터 양이 많음
• I/O 속도 저하
비용 증가
SELECT origin, count(*) as total_departures
FROM flight_csv
WHERE year >= ‘2000’
GROUP BY origin
ORDER BY total_departures DESC
LIMIT 10

Partition by Running MSCK Repair
• If your data is already partitioned in Hive format
• Use MSCK repair table command.
• To sync all new data in S3 with Hive metastore
• Example:
• msck repair table flights_csv
• show partitions flights_csv

Partition by Manually Adding Partitions
• If data is not partitioned in Hive format
• Cannot use MSCK repair table command
• Use ALTER TABLE ADD PARTITION to add each partition manually
• Look at automating adding partitions
• Example:
– ALTER TABLE elb_logs_raw_native_part ADD PARTITION
(year='2015',month='01',day='01’) location 's3://athena-
examples/elb/raw/2015/01/01/'

Data Catalog (Meta Store)
table 이름 데이터 위치
….
2016
flight_csv s3://athena-examples/flight/
파티션(year)
S3
$ aws s3 ls s3://athena-examples/flight/
PRE year=2005/
PRE year=2006/
PRE year=2007/
PRE year=2008/
PRE year=2009/
PRE year=2010/
PRE year=2011/
PRE year=2012/
PRE year=2013/
PRE year=2014/
PRE year=2015/
PRE year=2016/
S3의 디렉터리
구조에 맞게
파티션 정보 생성
새로운 파티션을 MetaStore에 추가하기
Data Catalog (Meta Store) – Schema, Location
ALTER TABLE … ADD PARTITIONS …
MSCK REPAIR …

Convert to Columnar Formats - Parquet/ORC
• Convert to Parquet/ORC, use EMR
• Create EMR cluster with Hive or Spark
• In the step section of create cluster
• Specify script in S3, Hive or PySpark
• Point to the input data in text/CSV/JSON
• Create output data in Parquet/ORC in S3
• Use Athena CTAS queries to create a new table with Parquet data from a source
table in a different format.
• CREATE TABLE new_table
WITH (
format = 'Parquet', parquet_compression = 'SNAPPY’)
AS SELECT * F
ROM old_table;

Different workloads - OLTP vs. OLAP
OLTP
• Online transaction
processing
• Lots of small operations
involving whole rows
OLAP
• Online analytical processing
• Few large operations
involving subset of all
columns
Assumption: I/O is expensive (memory, disk, network, …)

A0 B0 C0 A1 B1 C1 A2 B2 C2
A3 B3 C3 A4 B4 C4 A5 B5 C5
Row-wise vs. Columnar
Row-wise
• Horizontal partitioning
• OLTP ✓, OLAP ✗
Columnar
• Vertical partitioning
• OLTP ✗, OLAP ✓
• Free projection pushdown
• Compression opportunities
Columnar
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B5 C5
Row 0
Row 1
Row 2
Row 3
Row 4
Row 5
Col A Col B Col C
A0 A1 A2 A3 A4 A5 B0 B1 B2
B3 B4 B5 C0 C1 C2 C3 C4 C5
Row-wise

3.3 Optimize and Secure

Best Practices
• Partition your data
• Divides your table into parts and keeps the related data together based on column
values such as date, country, region, etc
• Bucket your data
• You can specify one or more columns containing rows that you want to group together,
and put those rows into multiple buckets.
• Compress and split files
• Can speed up your queries significantly
• Optimize File Sizes
• Ensuring that your file formats are splittable helps with parallelism

Best Practices Continued
• Optimize columnar data store generation
• Apache Parquet and Apache ORC
• Optimize Queries
• ORDER BY
• JOIN
• GROUP BY
• LIKE
• Only include the columns that you need

Access Control for Athena
• Access Control
• AWS Identity and Access Management (IAM)
• Access Control Lists (ACLs)
• Amazon S3 bucket policies.
• IAM policies for S3
• IAM is natively integrated with S3
• Grant IAM users fine-grained control to S3
• Restrict users from querying it using Athena

Access Control Continued
• To run queries in Athena, you must have the appropriate permissions for the
following:
• Athena API actions including additional actions for Athena workgroups
• Amazon S3 locations where the underlying data to query is stored.
• Metadata and resources that you store in the AWS Glue Data Catalog,
such as databases and tables, including additional actions for encrypted
metadata.

3.4 Workgroups

Athena Workgroups
Athena Workgroups are used to isolate
queries between different teams, workloads
or applications, and to set limits on amount
of data each query or the entire workgroup
can process
Workload Isolation Query Metrics Cost Controls

Workgroups – Workload Isolation
Unique query output
location per
Workgroup
Encrypt results with
unique AWS KMS key
per Workgroup
Collect and publish
aggregated metrics per
Workgroup to AWS
CloudWatch
Use Workgroup settings
eliminating need to
configure individual
users

Workgroups – Metric Reporting
Total bytes scanned
per Workgroup
Total failed queries
per Workgroup
Total successful
queries per
Workgroup
Total query execution
time per Workgroup

Workgroups – Cost Controls
• Per query data scanned threshold; exceeding, will cancel query
• Trigger alarms to notify of increasing usage and cost
• Disable Workgroup when all queries exceed a maximum threshold
Any Athena metric: successful/failed & total queries, query run time, etc.

Setting Up Workgroups
1. Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup
configuration. This determines IAM policies you create.
2. Create workgroups as needed, and add tags to them.
• Open the Athena console, choose the Workgroup:<workgroup_name> tab,
and then choose Create workgroup.
3. Create IAM policies for your users, groups, or roles to enable their access to
workgroups.
4. Specify a location in Amazon S3 for query results and encryption settings

Default Workgroup
By default, if you have not created any workgroups, all queries in your account run in
the primary workgroup

3.5 Views

View
• A view in Amazon Athena is a logical, not a physical table.
• The query that defines a view runs each time the view is referenced in a query.
• Views do not contain any data and do not write data.
• Instead, the query specified by the view runs each time you reference
the view by another query.

Create a View
• Creates a new view from a specified SELECT query.
• The view is a logical table that can be referenced by future queries.
• Flow:
• Create View:
• Write View:

3.6 CTAS

CREATE TABLE AS SELECT (CTAS)
• Creates a new table in Athena from the results of a SELECT statement
from another query
• Athena stores data files created by the CTAS statement in a specified
location in Amazon S3

CREATE TABLE new_table
WITH (
external_location =
's3://my_athena_results/my_parquet_stas_table/',
format = 'Parquet',
parquet_compression = 'SNAPPY'
)
AS SELECT *
FROM old_table;
See more examples in
https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
CTAS Example

Use CTAS queries to:
• Create tables from query results in one step, without repeatedly
querying raw data sets. This makes it easier to work with raw data
sets.
• Transform query results into other storage formats, such as Parquet
and ORC. This improves query performance and reduces query costs
in Athena.
• Create copies of existing tables that contain only the data you need.

Data Processing: ETL vs. ELT
Source1
Source2 Target
Source1
Source2
Staging
tables
Final
tables
Target
(MPP database)
Extract Transform Load
E → T → L
Extract & Load Transform
E → L → T
Source3
Source3

3.7 Monitoring & Auditing

Workgroups
• Separate users, teams, applications, or workloads
• Enable queries on Requester Pays buckets in Amazon S3
• In order to:
• Set limits on amount of data each query or the entire workgroup can
process
• Track costs
Ø Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup
configuration. This determines IAM policies you create.
Ø Create workgroups as needed, and add tags to them.
• Open the Athena console, choose the Workgroup:<workgroup_name> tab, and
then choose Create workgroup.
Ø Create IAM policies for your users, groups, or roles to enable their access to workgroups.
Ø Specify a location in Amazon S3 for query results and encryption settings

CloudTrail
Monitor Athena with AWS CloudTrail

Monitoring Athena Queries with CloudWatch Metrics
Use CloudWatch Events with Athena:
• EngineExecutionTime – in milliseconds
• ProcessedBytes – the total amount of data scanned per DML query
• QueryPlanningTime – in milliseconds
• QueryQueueTime – in milliseconds
• ServiceProcessingTime – in milliseconds
• TotalExecutionTime – in milliseconds, for DDL and DML queries

Monitoring Athena Queries with CloudWatch Events
• Use simple rules to match events and route
them to one or more target functions or
streams.
• Respond to operational changes and takes
corrective action as necessary, by
• Sending messages to respond to the environment
• Activating functions
• Making changes
• Capturing state information

4. Summary

Key Benefits of Athena
• Fast ad-hoc queries
• Query directly against data in S3
• Use standard ANSI SQL
• No infrastructure to setup and manage
• Use IAM for security
• JDBC driver for BI tools and SQL clients
• Pay per query, depending on data scanned

Getting Started
Tutorials
• Amazon Athena in Action: Workshop
• Amazon Athena Getting Started Cookbook
Examples
• Data visualization and anomaly detection using Amazon Athena and
Pandas from Amazon SageMaker (2020-09-25)
• Prepare data for model-training and invoke machine learning models
with Amazon Athena (2019-11-26)
Tools
• AWS Data Wrangler: Pandas on AWS
• PyAthena: a python client for Amazon Athena

Questions?

Introduction to Amazon Athena

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Amazon Athena

Similar to Introduction to Amazon Athena (20)

More from Sungmin Kim

More from Sungmin Kim (12)

Recently uploaded

Recently uploaded (20)

Introduction to Amazon Athena