Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Querying Data Pipeline with
AWS Athena
Yaroslav Tkachenko, Senior Software Engineer
Our main titles
Game clients
“Pipes”
Data Pipeline
Game servers
(events, metrics, telemetry, etc.)
Archiving
Data warehouse
Analytics
Exte...
The Problem
Now
Stream ProcessingData Warehouse
The Problem
Now
Stream ProcessingData Warehouse
● Not an option - want to query historical data● Need raw data
● Don’t wan...
Amazon Athena is an interactive query
service that makes it easy to analyze
data in Amazon S3 using standard SQL.
Athena i...
The Solution
• AWS S3 - persisting streaming data
• Schema definition - Apache Hive1
DDL for describing schemas
• Query language - Pres...
AWS S3
• JSON
• CSV
• Avro
• Parquet
• ORC
• Apache Web Server logs
• Logstash Grok
• CloudTrail
Supported formats
AWS S3
• Snapp...
AWS S3
S3?
Non-AWS pipelines:
• Kafka -> Kafka Connect, Secor
• ActiveMQ, RabbitMQ, etc. -> Camel
• ??? -> Custom Consumer
Streaming ...
Kinesis Pipeline
AWS S3
1. Kinesis Stream as an input
2. Lambda to forward to Firehose and transform (optional)
3. Kinesis...
Schema definition
Apache Hive Data Definition Language (DDL) is used for describing tables and databases:
Schema definition
ALTER DATABASE S...
CREATE EXTERNAL TABLE table_name (
id STRING
data STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.Pa...
TINYINT, SMALLINT, INT, BIGINT
BOOLEAN
DOUBLE, DECIMAL
STRING
BINARY
TIMESTAMP
DATE (not supported for Parquet)
VARCHAR
AR...
{
"metadata": {
"client_id": "21353253123",
"timestamp": 1497996200,
"category_id": "1"
},
"payload": "g3ng0943g93490gn309...
CREATE EXTERNAL TABLE events (
metadata struct<client_id:string,
timestamp:timestamp,
category_id:string
>,
payload string...
Query language
Presto SQL is used for querying data:
Query language
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item ...
SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit
10;
SELECT os, COUNT(*) count FROM cl...
Best practices
Good performance → Low cost!
Why?
Best practices
• Find good partitioning field like a date, version, user, etc.
• Update Athena with partitioning schema (use PARTITIONED ...
CREATE EXTERNAL TABLE events …
PARTITIONED BY (year string, month string, day string)
1) SELECT data FROM events WHERE eve...
• s3://events/2017/06/20/1.parquet
• s3://events/2017/06/20/2.parquet
• s3://events/2017/06/20/3.parquet
• s3://events/201...
• s3://events/date=2017-06-20/1.parquet
• s3://events/date=2017-06-20/2.parquet
• s3://events/date=2017-06-20/3.parquet
• ...
• Use binary formats like Parquet!
• Don’t forget about compression
• Only include the columns that you need
• LIMIT is am...
Volume of data
The dilemma
Number and size
of files (buffering)
Time to index
Given certain data volume you want the numbe...
Possible solutions?
• Don’t give up anything! Have two separate pipelines, one with long retention
(bigger files) and anot...
Demo
Amazon Product Reviews
http://jmcauley.ucsd.edu/data/amazon/
• AWS Athena is great, right?!
• Think about the file structure, formats, compression, etc.
• Streaming data to S3 is prob...
Questions?
@sap1ens
https://www.demonware.net/careers
Querying Data Pipeline with AWS Athena
Upcoming SlideShare
Loading in …5
×

Querying Data Pipeline with AWS Athena

1,460 views

Published on

Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day.

Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream.

AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds.

In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.

Published in: Technology
  • Be the first to comment

Querying Data Pipeline with AWS Athena

  1. 1. Querying Data Pipeline with AWS Athena Yaroslav Tkachenko, Senior Software Engineer
  2. 2. Our main titles
  3. 3. Game clients “Pipes” Data Pipeline Game servers (events, metrics, telemetry, etc.) Archiving Data warehouse Analytics External services
  4. 4. The Problem Now Stream ProcessingData Warehouse
  5. 5. The Problem Now Stream ProcessingData Warehouse ● Not an option - want to query historical data● Need raw data ● Don’t want to support complex infrastructure ● Retention is usually short
  6. 6. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. AWS Athena AWS Athena
  7. 7. The Solution
  8. 8. • AWS S3 - persisting streaming data • Schema definition - Apache Hive1 DDL for describing schemas • Query language - Presto2 (ANSI-compatible) SQL for querying [1] Apache Hive - distributed data warehouse software [2] Presto - distributed SQL query engine Building blocks AWS Athena
  9. 9. AWS S3
  10. 10. • JSON • CSV • Avro • Parquet • ORC • Apache Web Server logs • Logstash Grok • CloudTrail Supported formats AWS S3 • Snappy • Zlib • GZIP • LZO Supported compression
  11. 11. AWS S3 S3?
  12. 12. Non-AWS pipelines: • Kafka -> Kafka Connect, Secor • ActiveMQ, RabbitMQ, etc. -> Camel • ??? -> Custom Consumer Streaming data to S3 AWS S3 AWS pipelines: • Lambda <- SQS • Kinesis Stream -> Lambda -> Kinesis Firehose
  13. 13. Kinesis Pipeline AWS S3 1. Kinesis Stream as an input 2. Lambda to forward to Firehose and transform (optional) 3. Kinesis Firehose as a buffer (size or time), compression and another transformation (optional, using Lambda)
  14. 14. Schema definition
  15. 15. Apache Hive Data Definition Language (DDL) is used for describing tables and databases: Schema definition ALTER DATABASE SET DBPROPERTIES ALTER TABLE ADD PARTITION ALTER TABLE DROP PARTITION ALTER TABLE RENAME PARTITION ALTER TABLE SET LOCATION ALTER TABLE SET TBLPROPERTIES CREATE DATABASE CREATE TABLE DESCRIBE TABLE DROP DATABASE DROP TABLE MSCK REPAIR TABLE SHOW COLUMNS SHOW CREATE TABLE SHOW DATABASES SHOW PARTITIONS SHOW TABLES SHOW TBLPROPERTIES VALUES
  16. 16. CREATE EXTERNAL TABLE table_name ( id STRING data STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' LOCATION 's3://bucket-name/' tblproperties ("parquet.compress"="SNAPPY") Schema definition
  17. 17. TINYINT, SMALLINT, INT, BIGINT BOOLEAN DOUBLE, DECIMAL STRING BINARY TIMESTAMP DATE (not supported for Parquet) VARCHAR ARRAY, MAP, STRUCT Schema definition
  18. 18. { "metadata": { "client_id": "21353253123", "timestamp": 1497996200, "category_id": "1" }, "payload": "g3ng0943g93490gn3094" } Schema definition
  19. 19. CREATE EXTERNAL TABLE events ( metadata struct<client_id:string, timestamp:timestamp, category_id:string >, payload string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://events/' Schema definition
  20. 20. Query language
  21. 21. Presto SQL is used for querying data: Query language SELECT [ ALL | DISTINCT ] select_expression [, ...] [ FROM from_item [, ...] ] [ WHERE condition ] [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ] [ HAVING condition ] [ UNION [ ALL | DISTINCT ] union_query ] [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ] [ LIMIT [ count | ALL ] ]
  22. 22. SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit 10; SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os; SELECT customer.c_name, lineitem.l_quantity, orders.o_totalprice FROM lineitem, orders, customer WHERE lineitem.l_orderkey = orders.o_orderkey AND customer.c_custkey = orders.o_custkey; Schema definition
  23. 23. Best practices
  24. 24. Good performance → Low cost! Why? Best practices
  25. 25. • Find good partitioning field like a date, version, user, etc. • Update Athena with partitioning schema (use PARTITIONED BY in DDL) and metadata • You can create partitions manually or let Athena handle them (but that requires certain structure) • But there is no magic! You have to use partitioning fields in queries (like regular fields), otherwise no partitioning is applied Partitioning Best practices
  26. 26. CREATE EXTERNAL TABLE events … PARTITIONED BY (year string, month string, day string) 1) SELECT data FROM events WHERE event_id = '98632765’; 2) SELECT data FROM events WHERE event_id = '98632765' AND year = '2017' AND month = '06' AND day = '21'; Partitioning Best practices
  27. 27. • s3://events/2017/06/20/1.parquet • s3://events/2017/06/20/2.parquet • s3://events/2017/06/20/3.parquet • s3://events/2017/06/21/1.parquet • s3://events/2017/06/21/2.parquet • s3://events/2017/06/21/3.parquet • s3://events/2017/06/22/1.parquet • … Manual partitioning: • ALTER TABLE events ADD PARTITION (date='2017-06-20') location 's3://events/2017/06/20/' • ... Partitioning Best practices
  28. 28. • s3://events/date=2017-06-20/1.parquet • s3://events/date=2017-06-20/2.parquet • s3://events/date=2017-06-20/3.parquet • s3://events/date=2017-06-21/1.parquet • s3://events/date=2017-06-21/2.parquet • s3://events/date=2017-06-21/3.parquet • s3://events/date=2017-06-22/1.parquet • … Automatic partitioning: • MSCK REPAIR TABLE events Partitioning Best practices
  29. 29. • Use binary formats like Parquet! • Don’t forget about compression • Only include the columns that you need • LIMIT is amazing! • For more SQL optimizations look at Presto best practices • Avoid a lot of small files: Performance tips Best practices
  30. 30. Volume of data The dilemma Number and size of files (buffering) Time to index Given certain data volume you want the number of files as less as possible with file sizes as large as possible appear in S3 as soon as possible. It’s really hard. You have to give up something.
  31. 31. Possible solutions? • Don’t give up anything! Have two separate pipelines, one with long retention (bigger files) and another one with short retention (smaller files, fast time to index). Cons? Double on size. • Give up on number of files and size. But! Periodically merge small files in background. Cons? Lots of moving parts and slower queries against fresh data.
  32. 32. Demo Amazon Product Reviews http://jmcauley.ucsd.edu/data/amazon/
  33. 33. • AWS Athena is great, right?! • Think about the file structure, formats, compression, etc. • Streaming data to S3 is probably the hardest task • Don’t forget to optimize - use partitioning, look at Presto SQL optimization tricks, etc. • Good performance means low cost Summary
  34. 34. Questions? @sap1ens https://www.demonware.net/careers

×