Querying Data Pipeline with AWS Athena

Yaroslav Tkachenko
Yaroslav TkachenkoPrincipal Software Engineer at Goldsky
Querying Data Pipeline with
AWS Athena
Yaroslav Tkachenko, Senior Software Engineer
Querying Data Pipeline with AWS Athena
Our main titles
Game clients
“Pipes”
Data Pipeline
Game servers
(events, metrics, telemetry, etc.)
Archiving
Data warehouse
Analytics
External services
The Problem
Now
Stream ProcessingData Warehouse
The Problem
Now
Stream ProcessingData Warehouse
● Not an option - want to query historical data● Need raw data
● Don’t want to support complex infrastructure
● Retention is usually short
Amazon Athena is an interactive query
service that makes it easy to analyze
data in Amazon S3 using standard SQL.
Athena is serverless, so there is no
infrastructure to manage, and you pay
only for the queries that you run.
AWS Athena
AWS Athena
The Solution
• AWS S3 - persisting streaming data
• Schema definition - Apache Hive1
DDL for describing schemas
• Query language - Presto2
(ANSI-compatible) SQL for querying
[1] Apache Hive - distributed data warehouse software
[2] Presto - distributed SQL query engine
Building blocks
AWS Athena
AWS S3
• JSON
• CSV
• Avro
• Parquet
• ORC
• Apache Web Server logs
• Logstash Grok
• CloudTrail
Supported formats
AWS S3
• Snappy
• Zlib
• GZIP
• LZO
Supported compression
AWS S3
S3?
Non-AWS pipelines:
• Kafka -> Kafka Connect, Secor
• ActiveMQ, RabbitMQ, etc. -> Camel
• ??? -> Custom Consumer
Streaming data to S3
AWS S3
AWS pipelines:
• Lambda <- SQS
• Kinesis Stream -> Lambda -> Kinesis Firehose
Kinesis Pipeline
AWS S3
1. Kinesis Stream as an input
2. Lambda to forward to Firehose and transform (optional)
3. Kinesis Firehose as a buffer (size or time), compression and another transformation
(optional, using Lambda)
Schema definition
Apache Hive Data Definition Language (DDL) is used for describing tables and databases:
Schema definition
ALTER DATABASE SET DBPROPERTIES
ALTER TABLE ADD PARTITION
ALTER TABLE DROP PARTITION
ALTER TABLE RENAME PARTITION
ALTER TABLE SET LOCATION
ALTER TABLE SET TBLPROPERTIES
CREATE DATABASE
CREATE TABLE
DESCRIBE TABLE
DROP DATABASE
DROP TABLE
MSCK REPAIR TABLE
SHOW COLUMNS
SHOW CREATE TABLE
SHOW DATABASES
SHOW PARTITIONS
SHOW TABLES
SHOW TBLPROPERTIES
VALUES
CREATE EXTERNAL TABLE table_name (
id STRING
data STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket-name/'
tblproperties ("parquet.compress"="SNAPPY")
Schema definition
TINYINT, SMALLINT, INT, BIGINT
BOOLEAN
DOUBLE, DECIMAL
STRING
BINARY
TIMESTAMP
DATE (not supported for Parquet)
VARCHAR
ARRAY, MAP, STRUCT
Schema definition
{
"metadata": {
"client_id": "21353253123",
"timestamp": 1497996200,
"category_id": "1"
},
"payload": "g3ng0943g93490gn3094"
}
Schema definition
CREATE EXTERNAL TABLE events (
metadata struct<client_id:string,
timestamp:timestamp,
category_id:string
>,
payload string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://events/'
Schema definition
Query language
Presto SQL is used for querying data:
Query language
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit
10;
SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date
'2014-07-05' AND date '2014-08-05' GROUP BY os;
SELECT customer.c_name, lineitem.l_quantity, orders.o_totalprice FROM lineitem,
orders, customer WHERE lineitem.l_orderkey = orders.o_orderkey AND
customer.c_custkey = orders.o_custkey;
Schema definition
Best practices
Good performance → Low cost!
Why?
Best practices
• Find good partitioning field like a date, version, user, etc.
• Update Athena with partitioning schema (use PARTITIONED BY in DDL) and
metadata
• You can create partitions manually or let Athena handle them (but that requires
certain structure)
• But there is no magic! You have to use partitioning fields in queries (like regular
fields), otherwise no partitioning is applied
Partitioning
Best practices
CREATE EXTERNAL TABLE events …
PARTITIONED BY (year string, month string, day string)
1) SELECT data FROM events WHERE event_id = '98632765’;
2) SELECT data FROM events WHERE event_id = '98632765' AND year = '2017' AND
month = '06' AND day = '21';
Partitioning
Best practices
• s3://events/2017/06/20/1.parquet
• s3://events/2017/06/20/2.parquet
• s3://events/2017/06/20/3.parquet
• s3://events/2017/06/21/1.parquet
• s3://events/2017/06/21/2.parquet
• s3://events/2017/06/21/3.parquet
• s3://events/2017/06/22/1.parquet
• …
Manual partitioning:
• ALTER TABLE events ADD PARTITION (date='2017-06-20') location
's3://events/2017/06/20/'
• ...
Partitioning
Best practices
• s3://events/date=2017-06-20/1.parquet
• s3://events/date=2017-06-20/2.parquet
• s3://events/date=2017-06-20/3.parquet
• s3://events/date=2017-06-21/1.parquet
• s3://events/date=2017-06-21/2.parquet
• s3://events/date=2017-06-21/3.parquet
• s3://events/date=2017-06-22/1.parquet
• …
Automatic partitioning:
• MSCK REPAIR TABLE events
Partitioning
Best practices
• Use binary formats like Parquet!
• Don’t forget about compression
• Only include the columns that you need
• LIMIT is amazing!
• For more SQL optimizations look at Presto best practices
• Avoid a lot of small files:
Performance tips
Best practices
Volume of data
The dilemma
Number and size
of files (buffering)
Time to index
Given certain data volume you want the number of files as less as possible with file sizes
as large as possible appear in S3 as soon as possible. It’s really hard. You have to give up
something.
Possible solutions?
• Don’t give up anything! Have two separate pipelines, one with long retention
(bigger files) and another one with short retention (smaller files, fast time to
index). Cons? Double on size.
• Give up on number of files and size. But! Periodically merge small files in
background. Cons? Lots of moving parts and slower queries against fresh data.
Demo
Amazon Product Reviews
http://jmcauley.ucsd.edu/data/amazon/
• AWS Athena is great, right?!
• Think about the file structure, formats, compression, etc.
• Streaming data to S3 is probably the hardest task
• Don’t forget to optimize - use partitioning, look at Presto SQL optimization tricks, etc.
• Good performance means low cost
Summary
Questions?
@sap1ens
https://www.demonware.net/careers
1 of 35

Recommended

Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an... by
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
71.3K views30 slides
Kafka Lambda architecture with mirroring by
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
1.1K views2 slides
Lambda architecture by
Lambda architectureLambda architecture
Lambda architectureSzilveszter Molnár
1.2K views32 slides
BigData: AWS RedShift with S3, EC2 by
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2Paulraj Pappaiah
552 views8 slides
SMACK Stack 1.1 by
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
5.7K views61 slides
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S... by
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
86.1K views91 slides

More Related Content

What's hot

Amazon Athena Hands-On Workshop by
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopDoiT International
2.8K views52 slides
Data Analytics Service Company and Its Ruby Usage by
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
8.9K views48 slides
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization by
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
72.3K views28 slides
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis by
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
4.2K views90 slides
Ingesting data at scale into elasticsearch with apache pulsar by
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarTimothy Spann
1.3K views31 slides
Reactive dashboard’s using apache spark by
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
37.5K views44 slides

What's hot(20)

Data Analytics Service Company and Its Ruby Usage by SATOSHI TAGOMORI
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI8.9K views
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization by Patrick Di Loreto
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationUsing Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Patrick Di Loreto72.3K views
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis by Helena Edelson
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson4.2K views
Ingesting data at scale into elasticsearch with apache pulsar by Timothy Spann
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann1.3K views
Reactive dashboard’s using apache spark by Rahul Kumar
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar37.5K views
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS by Amazon Web Services
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services6.7K views
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N... by Amazon Web Services
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
Amazon Web Services4.9K views
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series by Amazon Web Services
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Apache Spark Streaming: Architecture and Fault Tolerance by Sachin Aggarwal
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal3.7K views
Developing a Real-time Engine with Akka, Cassandra, and Spray by Jacob Park
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
Jacob Park3.5K views
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data... by DataStax Academy
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy2K views
Kafka spark cassandra webinar feb 16 2016 by Hiromitsu Komatsu
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu356 views
Intro to Apache Spark by Mammoth Data
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Mammoth Data38K views
Spark streaming , Spark SQL by Yousun Jeong
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong1.4K views
Productizing Structured Streaming Jobs by Databricks
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks3.2K views
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012 by Amazon Web Services
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
Amazon Web Services3.1K views
Lambda architecture with Spark by Vincent GALOPIN
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN1.3K views
Streaming Big Data & Analytics For Scale by Helena Edelson
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
Helena Edelson5.2K views
Running Presto and Spark on the Netflix Big Data Platform by Eva Tse
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse1.8K views

Similar to Querying Data Pipeline with AWS Athena

Introduction to Amazon Athena by
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon AthenaAmazon Web Services
3.7K views58 slides
A Rusty introduction to Apache Arrow and how it applies to a time series dat... by
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
343 views25 slides
How we evolved data pipeline at Celtra and what we learned along the way by
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret
870 views50 slides
Advanced data modeling with apache cassandra by
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandraPatrick McFadin
16.3K views49 slides
Presentation by
PresentationPresentation
PresentationDimitris Stripelis
529 views45 slides
Letgo Data Platform: A global overview by
Letgo Data Platform: A global overviewLetgo Data Platform: A global overview
Letgo Data Platform: A global overviewRicardo Fanjul Fandiño
971 views102 slides

Similar to Querying Data Pipeline with AWS Athena(20)

A Rusty introduction to Apache Arrow and how it applies to a time series dat... by Andrew Lamb
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb343 views
How we evolved data pipeline at Celtra and what we learned along the way by Grega Kespret
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret870 views
Advanced data modeling with apache cassandra by Patrick McFadin
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
Patrick McFadin16.3K views
MongoDB for Time Series Data: Setting the Stage for Sensor Management by MongoDB
MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB5.4K views
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla by ScyllaDB
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by ScyllaScylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB4.3K views
IBM Cloud Native Day April 2021: Serverless Data Lake by Torsten Steinbach
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach105 views
Spark Sql for Training by Bryan Yang
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
Bryan Yang2.1K views
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser... by MongoDB
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB .local Houston 2019: Best Practices for Working with IoT and Time-ser...
MongoDB343 views
Real-Time Spark: From Interactive Queries to Streaming by Databricks
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks5.2K views
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG 2017... by Amazon Web Services
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG  2017...SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG  2017...
SQL Strikes Back! Options for Large Scale SQL Analytics - AWS Summit SG 2017...
Azure Stream Analytics : Analyse Data in Motion by Ruhani Arora
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
Ruhani Arora694 views
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306) by Amazon Web Services
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)
Amazon Web Services2.2K views
Get Value From Your Data by Danilo Poccia
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
Danilo Poccia1.3K views
2021 04-20 apache arrow and its impact on the database industry.pptx by Andrew Lamb
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb257 views
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd... by Riccardo Zamana
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Riccardo Zamana703 views

More from Yaroslav Tkachenko

Streaming SQL for Data Engineers: The Next Big Thing? by
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko
198 views57 slides
Apache Flink Adoption at Shopify by
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
1.1K views36 slides
Storing State Forever: Why It Can Be Good For Your Analytics by
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
483 views38 slides
It's Time To Stop Using Lambda Architecture by
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
213 views37 slides
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
542 views39 slides
Apache Kafka: New Features That You Might Not Know About by
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know AboutYaroslav Tkachenko
4.9K views25 slides

More from Yaroslav Tkachenko(17)

Streaming SQL for Data Engineers: The Next Big Thing? by Yaroslav Tkachenko
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko198 views
Storing State Forever: Why It Can Be Good For Your Analytics by Yaroslav Tkachenko
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko483 views
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by Yaroslav Tkachenko
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko542 views
Apache Kafka: New Features That You Might Not Know About by Yaroslav Tkachenko
Apache Kafka: New Features That You Might Not Know AboutApache Kafka: New Features That You Might Not Know About
Apache Kafka: New Features That You Might Not Know About
Yaroslav Tkachenko4.9K views
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson... by Yaroslav Tkachenko
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Yaroslav Tkachenko1.8K views
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games by Yaroslav Tkachenko
Designing Scalable and Extendable Data Pipeline for Call Of Duty GamesDesigning Scalable and Extendable Data Pipeline for Call Of Duty Games
Designing Scalable and Extendable Data Pipeline for Call Of Duty Games
Yaroslav Tkachenko1.1K views
10 tips for making Bash a sane programming language by Yaroslav Tkachenko
10 tips for making Bash a sane programming language10 tips for making Bash a sane programming language
10 tips for making Bash a sane programming language
Yaroslav Tkachenko765 views
Kafka Streams: the easiest way to start with stream processing by Yaroslav Tkachenko
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko6.6K views
Building Stateful Microservices With Akka by Yaroslav Tkachenko
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
Yaroslav Tkachenko1.9K views
Akka Microservices Architecture And Design by Yaroslav Tkachenko
Akka Microservices Architecture And DesignAkka Microservices Architecture And Design
Akka Microservices Architecture And Design
Yaroslav Tkachenko4.6K views
Why Actor-Based Systems Are The Best For Microservices by Yaroslav Tkachenko
Why Actor-Based Systems Are The Best For MicroservicesWhy Actor-Based Systems Are The Best For Microservices
Why Actor-Based Systems Are The Best For Microservices
Yaroslav Tkachenko940 views
Why actor-based systems are the best for microservices by Yaroslav Tkachenko
Why actor-based systems are the best for microservicesWhy actor-based systems are the best for microservices
Why actor-based systems are the best for microservices
Yaroslav Tkachenko4.3K views
Building Eventing Systems for Microservice Architecture by Yaroslav Tkachenko
Building Eventing Systems for Microservice Architecture  Building Eventing Systems for Microservice Architecture
Building Eventing Systems for Microservice Architecture
Yaroslav Tkachenko3.6K views
Быстрая и безболезненная разработка клиентской части веб-приложений by Yaroslav Tkachenko
Быстрая и безболезненная разработка клиентской части веб-приложенийБыстрая и безболезненная разработка клиентской части веб-приложений
Быстрая и безболезненная разработка клиентской части веб-приложений
Yaroslav Tkachenko801 views

Recently uploaded

Five Things You SHOULD Know About Postman by
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
38 views43 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
345 views20 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...ShapeBlue
37 views15 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
25 views13 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
50 views15 slides

Recently uploaded(20)

Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman38 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10345 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue37 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue25 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue28 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc72 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue46 views
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ... by ShapeBlue
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
Backroll, News and Demo - Pierre Charton, Matthias Dhellin, Ousmane Diarra - ...
ShapeBlue61 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue81 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue62 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue60 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue70 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院

Querying Data Pipeline with AWS Athena

  • 1. Querying Data Pipeline with AWS Athena Yaroslav Tkachenko, Senior Software Engineer
  • 4. Game clients “Pipes” Data Pipeline Game servers (events, metrics, telemetry, etc.) Archiving Data warehouse Analytics External services
  • 6. The Problem Now Stream ProcessingData Warehouse ● Not an option - want to query historical data● Need raw data ● Don’t want to support complex infrastructure ● Retention is usually short
  • 7. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. AWS Athena AWS Athena
  • 9. • AWS S3 - persisting streaming data • Schema definition - Apache Hive1 DDL for describing schemas • Query language - Presto2 (ANSI-compatible) SQL for querying [1] Apache Hive - distributed data warehouse software [2] Presto - distributed SQL query engine Building blocks AWS Athena
  • 11. • JSON • CSV • Avro • Parquet • ORC • Apache Web Server logs • Logstash Grok • CloudTrail Supported formats AWS S3 • Snappy • Zlib • GZIP • LZO Supported compression
  • 13. Non-AWS pipelines: • Kafka -> Kafka Connect, Secor • ActiveMQ, RabbitMQ, etc. -> Camel • ??? -> Custom Consumer Streaming data to S3 AWS S3 AWS pipelines: • Lambda <- SQS • Kinesis Stream -> Lambda -> Kinesis Firehose
  • 14. Kinesis Pipeline AWS S3 1. Kinesis Stream as an input 2. Lambda to forward to Firehose and transform (optional) 3. Kinesis Firehose as a buffer (size or time), compression and another transformation (optional, using Lambda)
  • 16. Apache Hive Data Definition Language (DDL) is used for describing tables and databases: Schema definition ALTER DATABASE SET DBPROPERTIES ALTER TABLE ADD PARTITION ALTER TABLE DROP PARTITION ALTER TABLE RENAME PARTITION ALTER TABLE SET LOCATION ALTER TABLE SET TBLPROPERTIES CREATE DATABASE CREATE TABLE DESCRIBE TABLE DROP DATABASE DROP TABLE MSCK REPAIR TABLE SHOW COLUMNS SHOW CREATE TABLE SHOW DATABASES SHOW PARTITIONS SHOW TABLES SHOW TBLPROPERTIES VALUES
  • 17. CREATE EXTERNAL TABLE table_name ( id STRING data STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' LOCATION 's3://bucket-name/' tblproperties ("parquet.compress"="SNAPPY") Schema definition
  • 18. TINYINT, SMALLINT, INT, BIGINT BOOLEAN DOUBLE, DECIMAL STRING BINARY TIMESTAMP DATE (not supported for Parquet) VARCHAR ARRAY, MAP, STRUCT Schema definition
  • 19. { "metadata": { "client_id": "21353253123", "timestamp": 1497996200, "category_id": "1" }, "payload": "g3ng0943g93490gn3094" } Schema definition
  • 20. CREATE EXTERNAL TABLE events ( metadata struct<client_id:string, timestamp:timestamp, category_id:string >, payload string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://events/' Schema definition
  • 22. Presto SQL is used for querying data: Query language SELECT [ ALL | DISTINCT ] select_expression [, ...] [ FROM from_item [, ...] ] [ WHERE condition ] [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ] [ HAVING condition ] [ UNION [ ALL | DISTINCT ] union_query ] [ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ] [ LIMIT [ count | ALL ] ]
  • 23. SELECT data FROM events where headers.user_id = 123 order by headers.timestamp limit 10; SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date '2014-07-05' AND date '2014-08-05' GROUP BY os; SELECT customer.c_name, lineitem.l_quantity, orders.o_totalprice FROM lineitem, orders, customer WHERE lineitem.l_orderkey = orders.o_orderkey AND customer.c_custkey = orders.o_custkey; Schema definition
  • 25. Good performance → Low cost! Why? Best practices
  • 26. • Find good partitioning field like a date, version, user, etc. • Update Athena with partitioning schema (use PARTITIONED BY in DDL) and metadata • You can create partitions manually or let Athena handle them (but that requires certain structure) • But there is no magic! You have to use partitioning fields in queries (like regular fields), otherwise no partitioning is applied Partitioning Best practices
  • 27. CREATE EXTERNAL TABLE events … PARTITIONED BY (year string, month string, day string) 1) SELECT data FROM events WHERE event_id = '98632765’; 2) SELECT data FROM events WHERE event_id = '98632765' AND year = '2017' AND month = '06' AND day = '21'; Partitioning Best practices
  • 28. • s3://events/2017/06/20/1.parquet • s3://events/2017/06/20/2.parquet • s3://events/2017/06/20/3.parquet • s3://events/2017/06/21/1.parquet • s3://events/2017/06/21/2.parquet • s3://events/2017/06/21/3.parquet • s3://events/2017/06/22/1.parquet • … Manual partitioning: • ALTER TABLE events ADD PARTITION (date='2017-06-20') location 's3://events/2017/06/20/' • ... Partitioning Best practices
  • 29. • s3://events/date=2017-06-20/1.parquet • s3://events/date=2017-06-20/2.parquet • s3://events/date=2017-06-20/3.parquet • s3://events/date=2017-06-21/1.parquet • s3://events/date=2017-06-21/2.parquet • s3://events/date=2017-06-21/3.parquet • s3://events/date=2017-06-22/1.parquet • … Automatic partitioning: • MSCK REPAIR TABLE events Partitioning Best practices
  • 30. • Use binary formats like Parquet! • Don’t forget about compression • Only include the columns that you need • LIMIT is amazing! • For more SQL optimizations look at Presto best practices • Avoid a lot of small files: Performance tips Best practices
  • 31. Volume of data The dilemma Number and size of files (buffering) Time to index Given certain data volume you want the number of files as less as possible with file sizes as large as possible appear in S3 as soon as possible. It’s really hard. You have to give up something.
  • 32. Possible solutions? • Don’t give up anything! Have two separate pipelines, one with long retention (bigger files) and another one with short retention (smaller files, fast time to index). Cons? Double on size. • Give up on number of files and size. But! Periodically merge small files in background. Cons? Lots of moving parts and slower queries against fresh data.
  • 34. • AWS Athena is great, right?! • Think about the file structure, formats, compression, etc. • Streaming data to S3 is probably the hardest task • Don’t forget to optimize - use partitioning, look at Presto SQL optimization tricks, etc. • Good performance means low cost Summary