SlideShare a Scribd company logo
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Amazon Athena
Julien Simon"
Principal Technical Evangelist
julsimon@amazon.fr
@julsimon
Big Data the way it should be
Questions
(not data!)
Answers
We shouldn’t have to
care about how this
really works !
Data
We shouldn’t have to
mess with this at all
Want to build it yourself? You need to master this
•  Planning capacity for storage and compute
•  Handling different data formats, structured and
unstructured (CSV, JSON, Parquet, Avro, etc.)
•  Learning complex programming models and
languages (Map Reduce, Spark, Scala, etc.)
•  Keeping costs under control
•  Availability, performance, security and a few more
Amazon Athena
•  New service announced at re:Invent 2016
•  Run read-only SQL queries on S3 data
•  No data load, no indexing, no nothing
•  No infrastructure to create, manage or scale
•  Availability: us-east-1, us-east-2, us-west-2
•  Pricing: $5 per Terabyte scanned 

AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
Athena queries
•  Service based on Presto (already available in Amazon EMR)
•  Table creation: Apache Hive Data Definition Language
–  CREATE EXTERNAL TABLE
•  ANSI SQL operators and functions: what Presto supports
•  Unsupported operations
–  User-defined functions (UDF or UDAFs)
–  Stored procedures
–  Any transaction found in Hive or Presto 
https://prestodb.io 
http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
Data formats supported by Athena
•  Unstructured
–  Apache logs, with customizable regular expression
•  Semi-structured
–  delimiter-separated values (CSV, OpenCSV)
–  Tab-separated values (TSV)
–  JSON
•  Structured
–  Apache Parquet https://parquet.apache.org/
–  Apache ORC https://orc.apache.org/ 
–  Apache Avro https://avro.apache.org/ 
•  Compression : Snappy, Zlib, GZIP, LZO
•  Partitioning 
•  Encryption : integration with KMS
Data partitioning
•  Partitioning reduces the amount of scanned data
–  Better performance
–  Cost optimization
•  Data may be already partitioned in S3
–  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) 
–  MSCK REPAIR TABLE table_name
•  Data can also be partitioned at table creation time
–  CREATE EXTERNAL TABLE table_name(…)
–  ALTER TABLE table_name ADD PARTITION …
Running queries on Athena
•  AWS Console (quite cool, actually)
–  Wizard for schema definition and table creation
–  Saved queries
–  Query history
–  Multiple queries in parallel

•  JDBC driver
–  SQL Workbench/J 
–  Java application
http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
Using columnar formats for fun and profit
•  Apache Parquet
•  Apache ORC
•  Ditto: better performance & cost optimization
•  You can convert your data to a columnar format with an
Amazon EMR cluster
•  More information and tutorial at
http://docs.aws.amazon.com/athena/latest/ug/convert-to-
columnar.html
Demo #1
Athena vs Redshift: data set"
Caveat: this isn’t a huge data set and it doesn’t have any joins"

•  1 table
•  1 billion lines of “e-commerce sales” (43GB)
•  CSV format, 10 columns
•  1000 files in S3, compressed to 12GB (bzip2)
Lastname, Firstname,Gender,State,Age,DayOfYear,Hour,Minutes,Items,Basket
YESTRAMSKI,KEELEY,F,Missouri,36,35,12,21,2,167
MAYOU,SCOTTIE,M,Arkansas,85,258,11,21,9,106
PFARR,SIDNEY,M,Indiana,59,146,22,21,3,163
RENZONI,ALLEN,M,Montana,31,227,13,49,10,106
CUMMINS,NICKY,M,Tennessee,50,362,1,33,1,115
THIMMESCH,BRIAN,M,Washington,29,302,20,41,2,95
Athena vs Redshift: table creation
CREATE EXTERNAL TABLE athenatest.sales (
lastname STRING,
firstname STRING,
gender STRING,
state STRING,
age INT,
day INT,
hour INT,
minutes INT,
items INT,
basket INT 
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://jsimon-redshift-demo-us/data/'
CREATE TABLE sales(
lastname VARCHAR(32) NOT NULL,
firstname VARCHAR(32) NOT NULL,
gender VARCHAR(1) NOT NULL,
state VARCHAR(32) NOT NULL,
age INT NOT NULL,
day INT NOT NULL,
hour INT NOT NULL,
minutes INT NOT NULL,
items INT NOT NULL,
basket INT NOT NULL)
DISTKEY(state)
COMPOUND SORTKEY (lastname,firstname);

COPY sales FROM 's3://jsimon-redshift-demo-us/data/'
REGION 'us-east-1' CREDENTIALS … DELIMITER ',' bzip2
COMPUPDATE ON;
Athena vs Redshift
•  Redshift
–  Start 4-node cluster: dc1.large, 160GB SSD, low I/O, $0.25/hr
–  Start 4-node cluster: dc1.8xlarge, 2.56TB SSD, v.high I/O, $4.80/hr
–  Create table
–  Load data from S3 (COPY operation)
–  Run some queries
•  Athena
–  Create table
–  Run the same queries
Athena vs Redshift: start your engines!
•  Athena
–  Initialization : < 5s (table creation)
–  Cost: $0.0025 for a full scan (12GB + a few thousand S3 requests)
–  Unlimited storage in S3
•  Redshift (dc1.large)
–  Initialization: 6mn (create cluster) + 40mn (data load)
–  $1/hr ($0.36 with 3-yr, 100% upfront RIs)
–  Maximum storage: 640GB (about 2TB with compression)
•  Redshift (dc1.8xlarge)
–  Initialization: 6mn (create cluster) + 4mn (data load)
–  $19.20/hr ($6 with 3-yr, 100% upfront RIs)
–  Maximum storage: 10TB (about 30TB with compression)
Athena vs Redshift: SQL queries
; Q1: count sales (1 full scan)
SELECT count(*) FROM sales
; Q2: average basket per gender (1 full scan)
SELECT gender, avg(basket) FROM sales GROUP BY gender;
; Q3: 5-day intervals when women spend most
SELECT floor(day/5.00)*5, sum(basket) AS spend FROM sales WHERE gender='F' GROUP
BY floor(day/5.00)*5 ORDER BY spend DESC LIMIT 10;
; Q4: top 10 states where women spend most in December (1 full scan)
SELECT state, sum(basket) AS spend FROM sales WHERE gender='F' AND day>=334 GROUP
BY state ORDER BY spend DESC LIMIT 10;
; Q5: list the top 10000 female customers in the top 10 states (2 full scans)
SELECT lastname, firstname, spend FROM (
SELECT lastname, firstname, sum(basket) AS spend FROM sales WHERE gender='F'
AND state IN(
SELECT state FROM sales WHERE day>=334 GROUP BY state
ORDER BY sum(basket) DESC LIMIT 10
) AND day >=334 GROUP BY lastname,firstname
) WHERE spend >=500 ORDER BY spend DESC LIMIT 10000;
Identical queries
on both systems
Athena vs Redshift: SQL queries"
YMMV, standard disclaimer applies J
; Q1: count sales
Athena: 15-17s, Redshift: 2-3s, Redshift 8xl: <1s
; Q2: average basket per gender
Athena: 20-22s, Redshift: 15-17s, Redshift 8xl: 4-5s
; Q3: 5-day intervals when women spend most
Athena: 20-22s, Redshift: 20-22s, Redshift 8xl: 4-5s
; Q4: top 10 states where women spend most in December
Athena: 22-25s, Redshift: 10-12s, Redshift 8xl: 2-3s
(courtesy of the ‘state’ distribution key)
; Q5: list the top 10000 female customers in the top 10 states
Athena: 38-40s, Redshift: 34-36s, Redshift 8xl: 7-9s
Demo #2
GDELT Data set
•  Global Database of Events, Language and Tone Database 
–  300 categories of political & diplomatic activities around the world
–  Georeferenced to the city
–  Dating back to January 1, 1979
–  http://www.gdeltproject.org/
–  http://blog.julien.org/2017/03/exploring-gdelt-data-set-with-amazon.html 
•  1582 CSV files in S3 (146 GB)
•  1 table (+ reference tables), 58 columns, 448M lines
•  https://aws.amazon.com/public-datasets/gdelt/
Using columnar formats for fun and profit
•  Hive makes it easy to convert from CSV to Parquet
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html 
•  Large request 
–  CSV uncompressed : 26 seconds, 136GB scanned, $0.13
–  Parquet compressed : 4 seconds, 2.2GB scanned, $0.002
Athena in a nutshell
•  Run SQL queries on S3 data
•  No infrastructure
•  Multiple input formats supported
•  Pretty fast!
•  A simple, very cost-efficient option for ad-hoc
analysis
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Julien Simon
julsimon@amazon.fr
@julsimon 
Your feedback 
is important to us!

More Related Content

What's hot

AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
Amazon Web Services
 

What's hot (20)

AWS Webinar-Introducing Amazon ElastiCache for Redis
AWS Webinar-Introducing Amazon ElastiCache for RedisAWS Webinar-Introducing Amazon ElastiCache for Redis
AWS Webinar-Introducing Amazon ElastiCache for Redis
 
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
AWS Summit London 2014 | Improving Availability and Lowering Costs (300)
 
Introduction to AWS Batch
Introduction to AWS BatchIntroduction to AWS Batch
Introduction to AWS Batch
 
Scalable Deep Learning on AWS with Apache MXNet
Scalable Deep Learning on AWS with Apache MXNetScalable Deep Learning on AWS with Apache MXNet
Scalable Deep Learning on AWS with Apache MXNet
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
AWS Webcast - High Availability SQL Server with Amazon RDS
AWS Webcast - High Availability SQL Server with Amazon RDSAWS Webcast - High Availability SQL Server with Amazon RDS
AWS Webcast - High Availability SQL Server with Amazon RDS
 
Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)Deep Dive on Amazon RDS (May 2016)
Deep Dive on Amazon RDS (May 2016)
 
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
AWS Summit London 2014 | Partners & Solutions Track | What's New at AWS?
 
Introduction to Amazon EC2 Spot
Introduction to Amazon EC2 SpotIntroduction to Amazon EC2 Spot
Introduction to Amazon EC2 Spot
 
AWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell NashAWS Innovate: Running Databases in AWS- Russell Nash
AWS Innovate: Running Databases in AWS- Russell Nash
 
AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large...
AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large...AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large...
AWS re:Invent 2016: Case Study: How Monsanto Uses Amazon EFS with Their Large...
 
This One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You ThousandsThis One Weird API Request Will Save You Thousands
This One Weird API Request Will Save You Thousands
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
AWS Summit London 2014 | Scaling on AWS for the First 10 Million Users (200)
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Introduction to Amazon EC2 Spot
Introduction to Amazon EC2 Spot Introduction to Amazon EC2 Spot
Introduction to Amazon EC2 Spot
 
Introduction on Amazon EC2
 Introduction on Amazon EC2 Introduction on Amazon EC2
Introduction on Amazon EC2
 
Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)Deep Learning with AWS (November 2016)
Deep Learning with AWS (November 2016)
 
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
AWS re:Invent 2016: ElastiCache Deep Dive: Best Practices and Usage Patterns ...
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon ML
 

Similar to Amazon Athena (April 2017)

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Amazon Web Services
 

Similar to Amazon Athena (April 2017) (20)

Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
Amazon Athena, w/ benchmark against Redshift - Pop-up Loft TLV 2017
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data WarehouseAWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 

More from Julien SIMON

More from Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 

Amazon Athena (April 2017)

  • 1. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Amazon Athena Julien Simon" Principal Technical Evangelist julsimon@amazon.fr @julsimon
  • 2. Big Data the way it should be Questions (not data!) Answers We shouldn’t have to care about how this really works ! Data We shouldn’t have to mess with this at all
  • 3. Want to build it yourself? You need to master this •  Planning capacity for storage and compute •  Handling different data formats, structured and unstructured (CSV, JSON, Parquet, Avro, etc.) •  Learning complex programming models and languages (Map Reduce, Spark, Scala, etc.) •  Keeping costs under control •  Availability, performance, security and a few more
  • 4.
  • 5. Amazon Athena •  New service announced at re:Invent 2016 •  Run read-only SQL queries on S3 data •  No data load, no indexing, no nothing •  No infrastructure to create, manage or scale •  Availability: us-east-1, us-east-2, us-west-2 •  Pricing: $5 per Terabyte scanned AWS re:Invent 2016: Introducing Athena (BDA303) https://www.youtube.com/watch?v=DxAuj_Ky5aw
  • 6. Athena queries •  Service based on Presto (already available in Amazon EMR) •  Table creation: Apache Hive Data Definition Language –  CREATE EXTERNAL TABLE •  ANSI SQL operators and functions: what Presto supports •  Unsupported operations –  User-defined functions (UDF or UDAFs) –  Stored procedures –  Any transaction found in Hive or Presto https://prestodb.io http://docs.aws.amazon.com/athena/latest/ug/known-limitations.html
  • 7. Data formats supported by Athena •  Unstructured –  Apache logs, with customizable regular expression •  Semi-structured –  delimiter-separated values (CSV, OpenCSV) –  Tab-separated values (TSV) –  JSON •  Structured –  Apache Parquet https://parquet.apache.org/ –  Apache ORC https://orc.apache.org/ –  Apache Avro https://avro.apache.org/ •  Compression : Snappy, Zlib, GZIP, LZO •  Partitioning •  Encryption : integration with KMS
  • 8. Data partitioning •  Partitioning reduces the amount of scanned data –  Better performance –  Cost optimization •  Data may be already partitioned in S3 –  CREATE EXTERNAL TABLE table_name(…) PARTITIONED BY (...) –  MSCK REPAIR TABLE table_name •  Data can also be partitioned at table creation time –  CREATE EXTERNAL TABLE table_name(…) –  ALTER TABLE table_name ADD PARTITION …
  • 9. Running queries on Athena •  AWS Console (quite cool, actually) –  Wizard for schema definition and table creation –  Saved queries –  Query history –  Multiple queries in parallel •  JDBC driver –  SQL Workbench/J –  Java application http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
  • 10. Using columnar formats for fun and profit •  Apache Parquet •  Apache ORC •  Ditto: better performance & cost optimization •  You can convert your data to a columnar format with an Amazon EMR cluster •  More information and tutorial at http://docs.aws.amazon.com/athena/latest/ug/convert-to- columnar.html
  • 12. Athena vs Redshift: data set" Caveat: this isn’t a huge data set and it doesn’t have any joins" •  1 table •  1 billion lines of “e-commerce sales” (43GB) •  CSV format, 10 columns •  1000 files in S3, compressed to 12GB (bzip2) Lastname, Firstname,Gender,State,Age,DayOfYear,Hour,Minutes,Items,Basket YESTRAMSKI,KEELEY,F,Missouri,36,35,12,21,2,167 MAYOU,SCOTTIE,M,Arkansas,85,258,11,21,9,106 PFARR,SIDNEY,M,Indiana,59,146,22,21,3,163 RENZONI,ALLEN,M,Montana,31,227,13,49,10,106 CUMMINS,NICKY,M,Tennessee,50,362,1,33,1,115 THIMMESCH,BRIAN,M,Washington,29,302,20,41,2,95
  • 13. Athena vs Redshift: table creation CREATE EXTERNAL TABLE athenatest.sales ( lastname STRING, firstname STRING, gender STRING, state STRING, age INT, day INT, hour INT, minutes INT, items INT, basket INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',' ) LOCATION 's3://jsimon-redshift-demo-us/data/' CREATE TABLE sales( lastname VARCHAR(32) NOT NULL, firstname VARCHAR(32) NOT NULL, gender VARCHAR(1) NOT NULL, state VARCHAR(32) NOT NULL, age INT NOT NULL, day INT NOT NULL, hour INT NOT NULL, minutes INT NOT NULL, items INT NOT NULL, basket INT NOT NULL) DISTKEY(state) COMPOUND SORTKEY (lastname,firstname); COPY sales FROM 's3://jsimon-redshift-demo-us/data/' REGION 'us-east-1' CREDENTIALS … DELIMITER ',' bzip2 COMPUPDATE ON;
  • 14. Athena vs Redshift •  Redshift –  Start 4-node cluster: dc1.large, 160GB SSD, low I/O, $0.25/hr –  Start 4-node cluster: dc1.8xlarge, 2.56TB SSD, v.high I/O, $4.80/hr –  Create table –  Load data from S3 (COPY operation) –  Run some queries •  Athena –  Create table –  Run the same queries
  • 15. Athena vs Redshift: start your engines! •  Athena –  Initialization : < 5s (table creation) –  Cost: $0.0025 for a full scan (12GB + a few thousand S3 requests) –  Unlimited storage in S3 •  Redshift (dc1.large) –  Initialization: 6mn (create cluster) + 40mn (data load) –  $1/hr ($0.36 with 3-yr, 100% upfront RIs) –  Maximum storage: 640GB (about 2TB with compression) •  Redshift (dc1.8xlarge) –  Initialization: 6mn (create cluster) + 4mn (data load) –  $19.20/hr ($6 with 3-yr, 100% upfront RIs) –  Maximum storage: 10TB (about 30TB with compression)
  • 16. Athena vs Redshift: SQL queries ; Q1: count sales (1 full scan) SELECT count(*) FROM sales ; Q2: average basket per gender (1 full scan) SELECT gender, avg(basket) FROM sales GROUP BY gender; ; Q3: 5-day intervals when women spend most SELECT floor(day/5.00)*5, sum(basket) AS spend FROM sales WHERE gender='F' GROUP BY floor(day/5.00)*5 ORDER BY spend DESC LIMIT 10; ; Q4: top 10 states where women spend most in December (1 full scan) SELECT state, sum(basket) AS spend FROM sales WHERE gender='F' AND day>=334 GROUP BY state ORDER BY spend DESC LIMIT 10; ; Q5: list the top 10000 female customers in the top 10 states (2 full scans) SELECT lastname, firstname, spend FROM ( SELECT lastname, firstname, sum(basket) AS spend FROM sales WHERE gender='F' AND state IN( SELECT state FROM sales WHERE day>=334 GROUP BY state ORDER BY sum(basket) DESC LIMIT 10 ) AND day >=334 GROUP BY lastname,firstname ) WHERE spend >=500 ORDER BY spend DESC LIMIT 10000; Identical queries on both systems
  • 17. Athena vs Redshift: SQL queries" YMMV, standard disclaimer applies J ; Q1: count sales Athena: 15-17s, Redshift: 2-3s, Redshift 8xl: <1s ; Q2: average basket per gender Athena: 20-22s, Redshift: 15-17s, Redshift 8xl: 4-5s ; Q3: 5-day intervals when women spend most Athena: 20-22s, Redshift: 20-22s, Redshift 8xl: 4-5s ; Q4: top 10 states where women spend most in December Athena: 22-25s, Redshift: 10-12s, Redshift 8xl: 2-3s (courtesy of the ‘state’ distribution key) ; Q5: list the top 10000 female customers in the top 10 states Athena: 38-40s, Redshift: 34-36s, Redshift 8xl: 7-9s
  • 19. GDELT Data set •  Global Database of Events, Language and Tone Database –  300 categories of political & diplomatic activities around the world –  Georeferenced to the city –  Dating back to January 1, 1979 –  http://www.gdeltproject.org/ –  http://blog.julien.org/2017/03/exploring-gdelt-data-set-with-amazon.html •  1582 CSV files in S3 (146 GB) •  1 table (+ reference tables), 58 columns, 448M lines •  https://aws.amazon.com/public-datasets/gdelt/
  • 20. Using columnar formats for fun and profit •  Hive makes it easy to convert from CSV to Parquet https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html •  Large request –  CSV uncompressed : 26 seconds, 136GB scanned, $0.13 –  Parquet compressed : 4 seconds, 2.2GB scanned, $0.002
  • 21. Athena in a nutshell •  Run SQL queries on S3 data •  No infrastructure •  Multiple input formats supported •  Pretty fast! •  A simple, very cost-efficient option for ad-hoc analysis
  • 22. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Julien Simon julsimon@amazon.fr @julsimon Your feedback is important to us!