Amazon Athena overview
S o f t w a r e E n g i n e e r a t E P A M
R a m a n M a s k a l e n k a
3 0 Н О Я Б Р Я
© 2019 EPAM Systems, Inc.
Table of context
A M A Z O N A T H E N A O V E R V I E W
S U P P O R T E D D A T A T Y P E S
T E C H N O L O G I E S U N D E R T H E H O O D
S I M P L E U S E C A S E
I N T E G R A T I O N W I T H O T H E R S E R V I C E S
T H I N G S T O C O N S I D E R W H I L E U S I N G
A T H E N A
2
© 2019 EPAM Systems, Inc.
Amazon Athena Overview
• Serverless
• No need of setting up an infrastructure
• Zero Spin up time
• Transparent upgrades
• Interactive
• High execution speed of queries
• Descriptive error messages
SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
© 2019 EPAM Systems, Inc.
Amazon Athena Overview
• Highly available
• Athena uses warm compute pools across multiple Availability Zones
• Your data is stored in S3 which is also designed for availability
• Core effective
• Automatically parallelize queries
• Results are streamed to console
• Tuned for performance
SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
© 2019 EPAM Systems, Inc.
Amazon Athena Overview
• Uses ANSI SQL
• Supports complex joins, nested queries and window functions
• Supports Complex data types (arrays, structs)
• Supports partitioning by almost any key, except datetime timestamp
• Cost effective
• Pay per query
• $5 per TB scanned
SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
© 2019 EPAM Systems, Inc.
Supported data types
• Text files (CSV, raw)
• Apache Web Logs, TSV
• JSON (simple, nested)
• Compressed files
• Apache parquet & Apache ORC
© 2019 EPAM Systems, Inc.
Technologies under the hood
Originally created by Facebook for their data
analysis to run interactive queries on large
amount of data.
• In-memory distributed query engine, ANSI-
SQL compatible with extensions
• Used by Athena for SQL queries
7
Data warehouse software project built on top of
Apache Hadoop for providing data query and
analysis. Allows to run SQL queries over
distributed data.
• Used by Athena for Data definition language
(DDL) functionality
• Supports complex datatypes and multiple
formats
• Supports partitioning
© 2019 EPAM Systems, Inc.
Simple use case
8
© 2019 EPAM Systems, Inc.
Simple use case
9
© 2019 EPAM Systems, Inc.
Integration with other services
10
© 2019 EPAM Systems, Inc.
Things to consider while using Athena
• No data transformation is made in S3
• You can write complex regexes for table creation
• You don’t pay for data transformation
• You can store your data in compressed format to lower the costs
• Rich access control (IAM, ACL, S3 bucket policies)
• Can be integrated with a lot of Business intelligence (BI) tools
PROS
© 2019 EPAM Systems, Inc.
Things to consider while using Athena
• Canceled queries will cost money for the data scanned
• Queries are rounded up to the nearest MB, with a 10 MB minimum.
• Query execution cost will consist of S3 data read + Athena scanned data rates
• Not all Hive DDL’s are supported by Athena
• Hive or Presto transactions are not supported by Athena
• User-defined functions and stored procedures are not supported
CONS
© 2019 EPAM Systems, Inc.
© 2019 EPAM Systems, Inc.

Amazon Athena overview

  • 1.
    Amazon Athena overview So f t w a r e E n g i n e e r a t E P A M R a m a n M a s k a l e n k a 3 0 Н О Я Б Р Я
  • 2.
    © 2019 EPAMSystems, Inc. Table of context A M A Z O N A T H E N A O V E R V I E W S U P P O R T E D D A T A T Y P E S T E C H N O L O G I E S U N D E R T H E H O O D S I M P L E U S E C A S E I N T E G R A T I O N W I T H O T H E R S E R V I C E S T H I N G S T O C O N S I D E R W H I L E U S I N G A T H E N A 2
  • 3.
    © 2019 EPAMSystems, Inc. Amazon Athena Overview • Serverless • No need of setting up an infrastructure • Zero Spin up time • Transparent upgrades • Interactive • High execution speed of queries • Descriptive error messages SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
  • 4.
    © 2019 EPAMSystems, Inc. Amazon Athena Overview • Highly available • Athena uses warm compute pools across multiple Availability Zones • Your data is stored in S3 which is also designed for availability • Core effective • Automatically parallelize queries • Results are streamed to console • Tuned for performance SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
  • 5.
    © 2019 EPAMSystems, Inc. Amazon Athena Overview • Uses ANSI SQL • Supports complex joins, nested queries and window functions • Supports Complex data types (arrays, structs) • Supports partitioning by almost any key, except datetime timestamp • Cost effective • Pay per query • $5 per TB scanned SERVERLESS INTERACTIVE HIGHLY AVAILABLE SQL QUERY SERVICE
  • 6.
    © 2019 EPAMSystems, Inc. Supported data types • Text files (CSV, raw) • Apache Web Logs, TSV • JSON (simple, nested) • Compressed files • Apache parquet & Apache ORC
  • 7.
    © 2019 EPAMSystems, Inc. Technologies under the hood Originally created by Facebook for their data analysis to run interactive queries on large amount of data. • In-memory distributed query engine, ANSI- SQL compatible with extensions • Used by Athena for SQL queries 7 Data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Allows to run SQL queries over distributed data. • Used by Athena for Data definition language (DDL) functionality • Supports complex datatypes and multiple formats • Supports partitioning
  • 8.
    © 2019 EPAMSystems, Inc. Simple use case 8
  • 9.
    © 2019 EPAMSystems, Inc. Simple use case 9
  • 10.
    © 2019 EPAMSystems, Inc. Integration with other services 10
  • 11.
    © 2019 EPAMSystems, Inc. Things to consider while using Athena • No data transformation is made in S3 • You can write complex regexes for table creation • You don’t pay for data transformation • You can store your data in compressed format to lower the costs • Rich access control (IAM, ACL, S3 bucket policies) • Can be integrated with a lot of Business intelligence (BI) tools PROS
  • 12.
    © 2019 EPAMSystems, Inc. Things to consider while using Athena • Canceled queries will cost money for the data scanned • Queries are rounded up to the nearest MB, with a 10 MB minimum. • Query execution cost will consist of S3 data read + Athena scanned data rates • Not all Hive DDL’s are supported by Athena • Hive or Presto transactions are not supported by Athena • User-defined functions and stored procedures are not supported CONS
  • 13.
    © 2019 EPAMSystems, Inc.
  • 14.
    © 2019 EPAMSystems, Inc.