SlideShare a Scribd company logo
Amazon Athena Workshop
26 January 2017
Agenda
1
2
3
4
5
Wi-Fi: DaHouseGuest, Pass: JustDoit!
Feedback Form: goo.gl/T9BZvy
Labs: github.com/doitintl/athena-workshop
2
Q & A
Breaks: 11:30 | 13:00 - 13:45 | 15:00
Facilities & Organization
DoIT International confidential │ Do not distribute
About us..
Vadim Solovey
CTO
Shahar Frank
Software Engineering Lead
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
Workshop Agenda
● Module 1
○ Introduction to AWS Athena
○ Demo
● Module 2
○ Interacting with AWS Athena
○ Lab 2
● Module 3
○ Supported Formats and SerDes
○ Lab 3
● Module 4
○ Partitioning Data
○ Lab 4
● Module 5
○ Converting to columnar formats
○ Lab 5
● Module 6
○ Athena Security
● Module 7
○ Service Limits
● Module 8
○ Comparison to Google BigQuery
○ Demo
[1] AWS Athena
[1] Introduction
Understanding Purpose & Use-Cases
[1] Challenges
Organizations are challenged with data analysis without heavy investments and long deployment time
● Significant amount of effort required to analyze data on S3
● Users often have access to only aggregated data sets
● Managing Hadoop or data warehouse requires expertise
[1] Introducing AWS Athena
Athena is an interactive query service that makes it easy to
analyze data directly from AWS S3 using Standard SQL
[1] AWS Athena Overview
Easy to use
1. Login to a console
2. Create a table (either by following a wizard or by typing Hive DDL statement)
3. Start querying
[1] AWS Athena is Highly Available
High Availability Features
● You connect to a service endpoint or log into a console
● Athena uses warm compute pools across multiple availability zones
● Your data is in Amazon S3 which has 99.999999999% durability
[1] Querying Data Directly from Amazon S3
Direct access to your data without hassles
● No loading of data
● No ETL required
● No additional storage required
● Query of data in raw format
[1] Use ANSI SQL
Use of skills you probably already have
● Start with writing Standard ANSI SQL syntax
● Support for complex joins, nested queries & window functions
● Support for complex data types (arrays, structs)
● Support for partitioning of data by any key:
○ e.g. date, time, custom keys
○ Or customer-year-month-day-hour
[1] AWS Athena Overview
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Features:
● Serverles with zero spin-up time and transparent upgrades
● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format
○ AVRO (coming soon)
● Compression is supported out of the box
● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query
Additional Information:
● Not a general purpose database
● Usually used by Data Analysts to run interactive queries over large datasets
● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
[1] Underlying Technologies
Presto (originating from Facebook)
● Used for SQL queries
● In-memory distributed querying engine ANSI SQL compatible with
extensions
Hive (originating from Hadoop project)
● Used for DDL functionality
● Complex data types
● Multitude of formats
● Supports data partitioning
[1] Presto vs. Hive Architecture
[1] Use Cases
Athena complements Amazon Redshift and Amazon EMR
AWS Athena
[2] Interacting with AWS Athena
Develop, Execute and Visualize Queries
[2] Interacting with AWS Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Web User Interface:
● Run queries and examine results
● Manage databases and tables
● Save queries and share across organization for re-use
● Query History
JDBC Driver:
● Programmatic way to access AWS Athena
○ SQL Workbench, JetBrains DataGrip, sqlline
○ Your own app
AWS QuickSight:
● Visualize Athena data with charts, pivots and dashboards.
Hands On
Lab 2
Interacting with AWS Athena
Data Formats
[3] Supported Formats and SerDes
Efficient Data Storage
[3] Data and Compression Formats
The data formats presently supported are
● CSV
● TSV
● Parquet (Snappy is default compression)
● ORC (Zlib is default compression)
● JSON
● Apache Web Server logs (RegexSerDe)
● Custom Delimiters
Compression Formats
● Currently, Snappy, Zlib, and GZIP are the supported compression formats.
● LZO is not supported as of today
[3] CSV Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://nyc-yellow-trips/csv/'
[3] Parquet Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS PARQUET
LOCATION 's3://nyc-yellow-trips/parquet
tblproperties ("parquet.compress"="SNAPPY");
[3] ORC Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS ORC
LOCATION 's3://nyc-yellow-trips/orc/’
tblproperties ("parquet.compress"="ZLIB");
[3] RegEx Serde (Apache Log Example)
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Date DATE, Time STRING, Location STRING,
Bytes INT, RequestIP STRING, Method STRING,
Host STRING, Uri STRING, Status INT, Referrer STRING,
os STRING, Browser STRING, BrowserVersion STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$")
LOCATION 's3://athena-examples/cloudfront/plaintext/';
[3] Comparing Formats
PARQUET
● Columnar format
● Schema segregation into footer
● Column major format
● All data is pushed to the leaf
● Integrated compression and indexes
● Support for predicate pushdown
ORC
● Apache Top Level Project
● Schema segregation into footer
● Column major format with stripes
● Integrated compression and indexes
and stats
● Support for predicate pushdown
[3] Comparing Formats
[3] Converting to Parquet or ORC format
● You can use Hive CTAS to convert data:
CREATE TABLE new_key_value_store
STORED AS PARQUET
AS SELECT c1, c2, c3, .., cN FROM noncolumunartable
SORT BY key
● You can also use Spark to convert the files to Parquet or ORC
● 20 lines of PySpark code running on EMR [1]
○ Converts 1TB of text data into 130GB of Parquet with Snappy compression
○ Approx. cost is $5
[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
[3] Pay By the Query ($5 per TB scanned)
● You are paying by the amount of scanned data
● Means to save on cost
○ Compress
○ Convert to columnar format
○ Use partitioning
● Free: DDL queries, failed queries
Dataset Size on S3 Query Runtime Data Scanned Cost
Logs stored as CSV 1TB 237s 1.15TB $5.75
Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013
Savings 87% less 34x faster 99% less 99.7% cheaper
Hands On
Lab 3
Formats & SerDes
AWS Athena
[4] Partitioning Data
To improve performance and reduce cost
[4] Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost
Benefits of Data Partitioning:
● Partitions limit the scope of data being scanned during the query
● Improves Performance
● Reduce query cost
● You can partition your data by any key
Common Practice:
● Based on time, often leading with a multi-level partitioning scheme
○ YEAR -> MONTH -> DAY -> HOUR
[4] Data already partitioned and stored on S3
$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
CREATE EXTERNAL TABLE impressions (
... ...)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
// load partitions into Athena
MSCK REPAIR TABLE impressions
// Run sample query
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
[4] Data is not partitioned
aws s3 ls s3://athena-examples/elb/plaintext/ --recursive
2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$
2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-
examples/elb/plaintext/2015/01/01/'
[5] AWS Athena
[5] Converting to Columnar Formats
Apache Parquet & ORC
[5] Converting to Columnar Formats (batch data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Hive installed.
● In the step section of the cluster create statement, you can specify a script stored in Amazon S3,
which points to your input data and creates output data in the columnar format in an Amazon S3
location. In this example, the cluster auto-terminates.
[5] Converting to Columnar Formats (streaming data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Spark
● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
AWS Athena
[6] Athena Security
Authorization and Access
[6] Athena Security
Amazon offers three ways to control data access:
● AWS Identity and Access Management policies
● Access Control Lists
● Amazon S3 bucket policies
Users are in control who can access data on S3. It’s possible to fine-tune security to allow different
people to see different sets of data and also to grant access to other user’s data.
AWS Athena
[7] Service Limits
Know your limits and mitigate the risk
[7] Service Limits
You can request a limit increase by contacting AWS Support.
● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent
queries at one time per account.
● Query timeout: 30 minutes
● Number of databases: 100
● Table: 100 per database
● Number of partitions: 20k per table
● You may encounter a limit for Amazon S3 buckets per account, which is 100.
[7] Known Limitations
The following are known limitations in Amazon Athena
● User-defined functions (UDF or UDAFs) are not supported.
● Stored procedures are not supported.
● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of
keywords not supported, see Unsupported DDL.
● LZO is not supported. Use Snappy instead.
[7] Avoid Surprises
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
...
For the LOCATION clause, using a trailing slash
USE
s3://path_to_bucket/
DO NOT USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
AWS Athena
[8] Comparing to Google BigQuery
Know your limits and mitigate the risk
DoIT International confidential │ Do not distribute
Google BigQuery
• Serverless Analytical Columnar Database based on Google Dremel
• Data:
• Native Tables
• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)
• Ingestion:
• File Imports
• Streaming API (up to 100K records/sec per table)
• Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)
• ANSI SQL 2011
• Priced at $5/TB of scanned data + storage + streaming (if used)
• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
DoIT International confidential │ Do not distribute
Summary
Feature  Product AWS Athena Google BigQuery
Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native
ANSI SQL Support Yes* Yes*
DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)
Underlying Technology FB Presto Google Dremel
Caching No Yes
Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity
User Defined Functions No Yes
Data Partitioning On Any Key By DAY
Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
DoIT International confidential │ Do not distribute
Test Drive Summary
Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %
[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%
[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%
[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%
[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%
[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%
[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%
[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
DoIT International confidential │ Do not distribute
What Athena does better than BigQuery?
Advantages:
• Can be faster than BigQuery, especially with federated/external tables
• Ability to use regex to define a schema (query files without needing to change the format)
• Can be faster and cheaper than BigQuery when using a partitioned/columnar format
• Tables can be partitioned on any column
Issues:
• It’s not easy to convert data between formats
• Doesn’t support DDL, i.e. no insert/update/delete
• No built-in ingestion
DoIT International confidential │ Do not distribute
What BigQuery does better than Athena?
• It has native table support giving it better performance and more features
• It’s easy to manipulate data, insert/update records and write query results back to a table
• Querying native tables is very fast
• Easy to convert non-columnar formats into a native table for columnar queries
• Supports UDFs, although they will be available in the future for Athena
• Supports nested tables (nested and repeated fields)
Remember to complete
your evaluations ;-)
https://goo.gl/T9BZvy

More Related Content

What's hot

An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
BRIJESH KUMAR
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
Kujambu Murugesan
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Microsoft Purview
Microsoft PurviewMicrosoft Purview
Microsoft Purview
Mohammed Chaaraoui
 
Azure purview
Azure purviewAzure purview
Azure purview
Shafqat Turza
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Cloud Scale Analytics Pitch Deck
Cloud Scale Analytics Pitch DeckCloud Scale Analytics Pitch Deck
Cloud Scale Analytics Pitch Deck
Nicholas Vossburg
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
Timothy Spann
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
snowpro (1).pdf
snowpro (1).pdfsnowpro (1).pdf
snowpro (1).pdf
suniltiwari160300
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 

What's hot (20)

An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx1- Introduction of Azure data factory.pptx
1- Introduction of Azure data factory.pptx
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Microsoft Purview
Microsoft PurviewMicrosoft Purview
Microsoft Purview
 
Azure purview
Azure purviewAzure purview
Azure purview
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Cloud Scale Analytics Pitch Deck
Cloud Scale Analytics Pitch DeckCloud Scale Analytics Pitch Deck
Cloud Scale Analytics Pitch Deck
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
FLiP Into Trino
FLiP Into TrinoFLiP Into Trino
FLiP Into Trino
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
snowpro (1).pdf
snowpro (1).pdfsnowpro (1).pdf
snowpro (1).pdf
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 

Viewers also liked

Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
DoiT International
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Amazon Web Services
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
DoiT International
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
Julien SIMON
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
DoiT International
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
Julien SIMON
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
DataStax
 
2015 Internet Trends Report
2015 Internet Trends Report2015 Internet Trends Report
2015 Internet Trends Report
IQbal KHan
 
K8S in prod
K8S in prodK8S in prod
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
Amazon Web Services Japan
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
DoiT International
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
Adam Book
 
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستيرالفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
Prof. Sherif Shaheen
 
Superfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the worldSuperfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the world
Chloe Tilley
 
Ensayo blogger def 2
Ensayo blogger def 2Ensayo blogger def 2
Ensayo blogger def 2AldoMaGe
 
Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"
sveta7940
 

Viewers also liked (16)

Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
 
2015 Internet Trends Report
2015 Internet Trends Report2015 Internet Trends Report
2015 Internet Trends Report
 
K8S in prod
K8S in prodK8S in prod
K8S in prod
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستيرالفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
 
Superfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the worldSuperfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the world
 
Ensayo blogger def 2
Ensayo blogger def 2Ensayo blogger def 2
Ensayo blogger def 2
 
Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"
 

Similar to Amazon Athena Hands-On Workshop

NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
Amazon Web Services
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
Pushkar Chivate
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Taking SharePoint to the Cloud
Taking SharePoint to the CloudTaking SharePoint to the Cloud
Taking SharePoint to the Cloud
Aaron Saikovski
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
Kevin Epstein
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
Amazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
Amazon Web Services
 
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud PlatformMap Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
문기 박
 

Similar to Amazon Athena Hands-On Workshop (20)

NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Taking SharePoint to the Cloud
Taking SharePoint to the CloudTaking SharePoint to the Cloud
Taking SharePoint to the Cloud
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud PlatformMap Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
 

More from DoiT International

Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
DoiT International
 
GAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor CoresGAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor Cores
DoiT International
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
DoiT International
 
K8s best practices from the field!
K8s best practices from the field!K8s best practices from the field!
K8s best practices from the field!
DoiT International
 
An Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure MicroservicesAn Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure Microservices
DoiT International
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
DoiT International
 
Applying ML for Log Analysis
Applying ML for Log AnalysisApplying ML for Log Analysis
Applying ML for Log Analysis
DoiT International
 
GCP for AWS Professionals
GCP for AWS ProfessionalsGCP for AWS Professionals
GCP for AWS Professionals
DoiT International
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
DoiT International
 
Scaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami MahloofScaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami Mahloof
DoiT International
 
CI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar DemriCI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar Demri
DoiT International
 
Kubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen FisherKubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen Fisher
DoiT International
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)
DoiT International
 

More from DoiT International (15)

Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
GAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor CoresGAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor Cores
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
 
K8s best practices from the field!
K8s best practices from the field!K8s best practices from the field!
K8s best practices from the field!
 
An Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure MicroservicesAn Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure Microservices
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Applying ML for Log Analysis
Applying ML for Log AnalysisApplying ML for Log Analysis
Applying ML for Log Analysis
 
GCP for AWS Professionals
GCP for AWS ProfessionalsGCP for AWS Professionals
GCP for AWS Professionals
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
 
Scaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami MahloofScaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami Mahloof
 
CI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar DemriCI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar Demri
 
Kubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen FisherKubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen Fisher
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)
 

Recently uploaded

History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
JungkooksNonexistent
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 

Recently uploaded (20)

History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Latest trends in computer networking.pptx
Latest trends in computer networking.pptxLatest trends in computer networking.pptx
Latest trends in computer networking.pptx
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 

Amazon Athena Hands-On Workshop

  • 2. Agenda 1 2 3 4 5 Wi-Fi: DaHouseGuest, Pass: JustDoit! Feedback Form: goo.gl/T9BZvy Labs: github.com/doitintl/athena-workshop 2 Q & A Breaks: 11:30 | 13:00 - 13:45 | 15:00 Facilities & Organization
  • 3. DoIT International confidential │ Do not distribute About us.. Vadim Solovey CTO Shahar Frank Software Engineering Lead
  • 4. DoIT International confidential │ Do not distribute
  • 5. DoIT International confidential │ Do not distribute
  • 6. DoIT International confidential │ Do not distribute
  • 7. Workshop Agenda ● Module 1 ○ Introduction to AWS Athena ○ Demo ● Module 2 ○ Interacting with AWS Athena ○ Lab 2 ● Module 3 ○ Supported Formats and SerDes ○ Lab 3 ● Module 4 ○ Partitioning Data ○ Lab 4 ● Module 5 ○ Converting to columnar formats ○ Lab 5 ● Module 6 ○ Athena Security ● Module 7 ○ Service Limits ● Module 8 ○ Comparison to Google BigQuery ○ Demo
  • 8. [1] AWS Athena [1] Introduction Understanding Purpose & Use-Cases
  • 9. [1] Challenges Organizations are challenged with data analysis without heavy investments and long deployment time ● Significant amount of effort required to analyze data on S3 ● Users often have access to only aggregated data sets ● Managing Hadoop or data warehouse requires expertise
  • 10. [1] Introducing AWS Athena Athena is an interactive query service that makes it easy to analyze data directly from AWS S3 using Standard SQL
  • 11. [1] AWS Athena Overview Easy to use 1. Login to a console 2. Create a table (either by following a wizard or by typing Hive DDL statement) 3. Start querying
  • 12. [1] AWS Athena is Highly Available High Availability Features ● You connect to a service endpoint or log into a console ● Athena uses warm compute pools across multiple availability zones ● Your data is in Amazon S3 which has 99.999999999% durability
  • 13. [1] Querying Data Directly from Amazon S3 Direct access to your data without hassles ● No loading of data ● No ETL required ● No additional storage required ● Query of data in raw format
  • 14. [1] Use ANSI SQL Use of skills you probably already have ● Start with writing Standard ANSI SQL syntax ● Support for complex joins, nested queries & window functions ● Support for complex data types (arrays, structs) ● Support for partitioning of data by any key: ○ e.g. date, time, custom keys ○ Or customer-year-month-day-hour
  • 15. [1] AWS Athena Overview Amazon Athena is server-less way to query your data that lives on S3 using SQL Features: ● Serverles with zero spin-up time and transparent upgrades ● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format ○ AVRO (coming soon) ● Compression is supported out of the box ● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query Additional Information: ● Not a general purpose database ● Usually used by Data Analysts to run interactive queries over large datasets ● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
  • 16. [1] Underlying Technologies Presto (originating from Facebook) ● Used for SQL queries ● In-memory distributed querying engine ANSI SQL compatible with extensions Hive (originating from Hadoop project) ● Used for DDL functionality ● Complex data types ● Multitude of formats ● Supports data partitioning
  • 17. [1] Presto vs. Hive Architecture
  • 18. [1] Use Cases Athena complements Amazon Redshift and Amazon EMR
  • 19. AWS Athena [2] Interacting with AWS Athena Develop, Execute and Visualize Queries
  • 20. [2] Interacting with AWS Athena Amazon Athena is server-less way to query your data that lives on S3 using SQL Web User Interface: ● Run queries and examine results ● Manage databases and tables ● Save queries and share across organization for re-use ● Query History JDBC Driver: ● Programmatic way to access AWS Athena ○ SQL Workbench, JetBrains DataGrip, sqlline ○ Your own app AWS QuickSight: ● Visualize Athena data with charts, pivots and dashboards.
  • 21. Hands On Lab 2 Interacting with AWS Athena
  • 22. Data Formats [3] Supported Formats and SerDes Efficient Data Storage
  • 23. [3] Data and Compression Formats The data formats presently supported are ● CSV ● TSV ● Parquet (Snappy is default compression) ● ORC (Zlib is default compression) ● JSON ● Apache Web Server logs (RegexSerDe) ● Custom Delimiters Compression Formats ● Currently, Snappy, Zlib, and GZIP are the supported compression formats. ● LZO is not supported as of today
  • 24. [3] CSV Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '' LINES TERMINATED BY 'n' LOCATION 's3://nyc-yellow-trips/csv/'
  • 25. [3] Parquet Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet tblproperties ("parquet.compress"="SNAPPY");
  • 26. [3] ORC Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS ORC LOCATION 's3://nyc-yellow-trips/orc/’ tblproperties ("parquet.compress"="ZLIB");
  • 27. [3] RegEx Serde (Apache Log Example) CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$") LOCATION 's3://athena-examples/cloudfront/plaintext/';
  • 28. [3] Comparing Formats PARQUET ● Columnar format ● Schema segregation into footer ● Column major format ● All data is pushed to the leaf ● Integrated compression and indexes ● Support for predicate pushdown ORC ● Apache Top Level Project ● Schema segregation into footer ● Column major format with stripes ● Integrated compression and indexes and stats ● Support for predicate pushdown
  • 30. [3] Converting to Parquet or ORC format ● You can use Hive CTAS to convert data: CREATE TABLE new_key_value_store STORED AS PARQUET AS SELECT c1, c2, c3, .., cN FROM noncolumunartable SORT BY key ● You can also use Spark to convert the files to Parquet or ORC ● 20 lines of PySpark code running on EMR [1] ○ Converts 1TB of text data into 130GB of Parquet with Snappy compression ○ Approx. cost is $5 [1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 31. [3] Pay By the Query ($5 per TB scanned) ● You are paying by the amount of scanned data ● Means to save on cost ○ Compress ○ Convert to columnar format ○ Use partitioning ● Free: DDL queries, failed queries Dataset Size on S3 Query Runtime Data Scanned Cost Logs stored as CSV 1TB 237s 1.15TB $5.75 Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013 Savings 87% less 34x faster 99% less 99.7% cheaper
  • 33. AWS Athena [4] Partitioning Data To improve performance and reduce cost
  • 34. [4] Partitioning Data By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost Benefits of Data Partitioning: ● Partitions limit the scope of data being scanned during the query ● Improves Performance ● Reduce query cost ● You can partition your data by any key Common Practice: ● Based on time, often leading with a multi-level partitioning scheme ○ YEAR -> MONTH -> DAY -> HOUR
  • 35. [4] Data already partitioned and stored on S3 $ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ CREATE EXTERNAL TABLE impressions ( ... ...) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ; // load partitions into Athena MSCK REPAIR TABLE impressions // Run sample query SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
  • 36. [4] Data is not partitioned aws s3 ls s3://athena-examples/elb/plaintext/ --recursive 2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$ 2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena- examples/elb/plaintext/2015/01/01/'
  • 37. [5] AWS Athena [5] Converting to Columnar Formats Apache Parquet & ORC
  • 38. [5] Converting to Columnar Formats (batch data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Hive installed. ● In the step section of the cluster create statement, you can specify a script stored in Amazon S3, which points to your input data and creates output data in the columnar format in an Amazon S3 location. In this example, the cluster auto-terminates.
  • 39. [5] Converting to Columnar Formats (streaming data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Spark ● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
  • 40. AWS Athena [6] Athena Security Authorization and Access
  • 41. [6] Athena Security Amazon offers three ways to control data access: ● AWS Identity and Access Management policies ● Access Control Lists ● Amazon S3 bucket policies Users are in control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.
  • 42. AWS Athena [7] Service Limits Know your limits and mitigate the risk
  • 43. [7] Service Limits You can request a limit increase by contacting AWS Support. ● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent queries at one time per account. ● Query timeout: 30 minutes ● Number of databases: 100 ● Table: 100 per database ● Number of partitions: 20k per table ● You may encounter a limit for Amazon S3 buckets per account, which is 100.
  • 44. [7] Known Limitations The following are known limitations in Amazon Athena ● User-defined functions (UDF or UDAFs) are not supported. ● Stored procedures are not supported. ● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of keywords not supported, see Unsupported DDL. ● LZO is not supported. Use Snappy instead.
  • 45. [7] Avoid Surprises Use backticks if table names begin with an underscore. For example: CREATE TABLE myUnderScoreTable ( `_id` string, `_index`string, ... For the LOCATION clause, using a trailing slash USE s3://path_to_bucket/ DO NOT USE s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat
  • 46. AWS Athena [8] Comparing to Google BigQuery Know your limits and mitigate the risk
  • 47. DoIT International confidential │ Do not distribute Google BigQuery • Serverless Analytical Columnar Database based on Google Dremel • Data: • Native Tables • External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket) • Ingestion: • File Imports • Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet) • ANSI SQL 2011 • Priced at $5/TB of scanned data + storage + streaming (if used) • Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
  • 48. DoIT International confidential │ Do not distribute Summary Feature Product AWS Athena Google BigQuery Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native ANSI SQL Support Yes* Yes* DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas) Underlying Technology FB Presto Google Dremel Caching No Yes Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity User Defined Functions No Yes Data Partitioning On Any Key By DAY Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
  • 49. DoIT International confidential │ Do not distribute Test Drive Summary Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff % [1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51% [2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48% [3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27% [4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470% [5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100% [6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31% [7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
  • 50. DoIT International confidential │ Do not distribute What Athena does better than BigQuery? Advantages: • Can be faster than BigQuery, especially with federated/external tables • Ability to use regex to define a schema (query files without needing to change the format) • Can be faster and cheaper than BigQuery when using a partitioned/columnar format • Tables can be partitioned on any column Issues: • It’s not easy to convert data between formats • Doesn’t support DDL, i.e. no insert/update/delete • No built-in ingestion
  • 51. DoIT International confidential │ Do not distribute What BigQuery does better than Athena? • It has native table support giving it better performance and more features • It’s easy to manipulate data, insert/update records and write query results back to a table • Querying native tables is very fast • Easy to convert non-columnar formats into a native table for columnar queries • Supports UDFs, although they will be available in the future for Athena • Supports nested tables (nested and repeated fields)
  • 52. Remember to complete your evaluations ;-) https://goo.gl/T9BZvy