SlideShare a Scribd company logo
1 of 42
Amazon Athena
Kevin Epstein
What to expect…
• Overview of Amazon Athena
• Key Features
• Customer Examples
• Troubleshooting Query Errors
• Q&A
Challenges customers faced
• Significant amount of work required to analyze data
in Amazon S3
• Users often only have access to aggregated data
sets
• Managing a Hadoop cluster or
data warehouse requires expertise
Introducing Amazon Athena
Amazon Athena is an interactive query
service that makes it easy to analyze
data directly from Amazon S3 using
Standard SQL
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Athena is easy!
• Log into the Console
• Create a table
– Type in a Hive DDL Statement
– Use the console Add Table
wizard
• Start querying
Athena is Highly Available
• You connect to a service endpoint or log into the
console
• Athena uses warm compute pools across multiple
Availability Zones
• Your data is in Amazon S3, which is also highly available
and designed for 99.999999999% durability
Query data directly from S3
• No loading of data
• Query data in its raw format
– Text, CSV, JSON, weblogs, AWS service logs
– Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
Athena is ANSI SQL Compliant
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by any
key
– (date, time, custom keys)
– e.g., Year, Month, Day, Hour or
Customer Key, Date
Athena’s internals are familiar
Used for SQL
Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL
functionality
• Complex data types
• Multitude of formats
• Supports data partitioning
Multiple formats supported
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Apache Parquet & Apache ORC,
& AVRO
Athena is Fast
• Tuned for performance
• Automatically parallelizes queries
• Results are streamed to console
• Results also stored in S3
• Improve Query performance
– Compress your data
– Use columnar formats
Athena is cost effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions
Who is Athena for?
• Any one looking to process data stored in Amazon S3
– Data coming IOT Devices, Apache Logs, Omniture logs,
CF logs, Application Logs
• Anyone who knows SQL
– Both developers or Analysts
• Ad-hoc exploration of data and data discovery
• Customers looking to build a data lake on Amazon S3
An example pipeline
An example pipeline
Ad-hoc access to raw data using SQL
An example pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
Re-visiting the challenges
Significant amount of work required to analyze data in Amazon
S3
No ETL required. No loading of data. Query data where it lives
Users often only have access to aggregated data sets
Query data at whatever granularity you want
Managing a Hadoop cluster or data warehouse requires
expertise
No infrastructure to manage
Accessing Athena
Use the JDBC Driver
Athena & QuickSight
+ =
Setting up QuickSight with Athena
Setting up QuickSight with Athena
Setting up QuickSight with Athena
Setting up QuickSight with Athena
Setting up QuickSight with Athena
Programatic JDBC Access
/* Setup the driver */
Properties info = new Properties();
info.put("user", "AWSAccessKey");
info.put("password", "AWSSecretAccessKey");
info.put("s3_staging_dir", "s3://S3 Bucket Location/");
Class.forName("com.amazonaws.athena.jdbc.AthenaDriver");
Connection connection = DriverManager.getConnection(
"jdbc:awsathena://athena.us-east-1.amazonaws.com:443/", info);
Create a table & execute a query
/* Create a table */
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery("CREATE EXTERNAL TABLE
tableName ( Col1 String ) LOCATION ‘s3://bucket/tableLocation");
/* Execute a Query */
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery("SELECT * FROM
cloudfront_logs");
Creating Tables & Querying Data
Creating Tables – The basics
Create Table Statements (or DDL) are written
in Hive
• High degree of flexibility
• Schema on Read
• Hive is SQL like but allows other concepts such “external tables”
and partitioning of data
• Data formats supported – JSON, TXT, CSV, TSV, Parquet and
ORC (via Serdes)
• Data in stored in Amazon S3
• Metadata is stored in an a metadata store
Athena’s internal metastore
• Stores Metadata
– Table definition, column names,
partitions
• Highly available and durable
• Requires no management
• Access via DDL statements
• Similar to a Hive Metastore
Running a query from the console
Run time and
data scanned
Converting to ORC & PARQUET
• You can use Hive CTAS to convert data
– CREATE TABLE new_key_value_store
– STORED AS PARQUET
– AS
– SELECT col_1, col2, col3 FROM noncolumartable
– SORT BY new_key, key_value_pair;
• You can also use Spark to convert the file into PARQUET / ORC
• 20 lines of Pyspark code, running on EMR
– Converts 1TB of text data into 130 GB of Parquet with snappy conversion
– Total cost $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Pay by the query - $5/TB scanned
• Pay by the amount of data scanned per
query
• Ways to save costs
– Compress
– Convert to Columnar format
– Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Athena complements other services
Tips & Tricks
Can’t see data after creating a table?
• Verify that the input LOCATION is correct on S3
– Buckets should be specified as s3://name/ or s3://name/subfolder/
– s3://us-east-1.amazonaws.com/bucket/path will throw an error s3://bucket/path/
• Did the table have partitions ?
– MSCK Repair Table for
– s3://mybucket/athena/inputdata/year=2016/data.csv
– s3://mybucket/athena/inputdata/year=2015/data.csv
– s3://mybucket/athena/inputdata/year=2014/data.csv
– Alter Table Add Partition for
– s3://mybucket/athena/inputdata/2016/data.csv
– s3://mybucket/athena/inputdata/2015/data.csv
– s3://mybucket/athena/inputdata/2014/data.csv
– ALTER TABLE Employee ADD
– PARTITION (year=2016) LOCATION s3://mybucket/athena/inputdata/2016/'
– PARTITION (year=2015) LOCATION s3://mybucket/athena/inputdata/2015/'
– PARTITION (year=2014) LOCATION s3://mybucket/athena/inputdata/2014/'
Did you correctly define your
partitions?
Reading JSON data
• Make sure you are using the right Serde
– Native JSON Serde org.apache.hive.hcatalog.data.JsonSerDe
– OpenX SerDe (org.openx.data.jsonserde.JsonSerDe)
• Make sure JSON record is a single line
• Generate your data in case-insensitive columns
• Provide an option to ignore malformed records
CREATE EXTERNAL TABLE json (
a string,
b int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://bucket/path/';
Access denied errors?
• Check the IAM Policy
– Refer to the Getting Started Documentation
• Check the bucket ACL
– Both the read bucket and write bucket
Table Names
• Use backticks if table names begin with an underscore.
For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
• Table name can only have underscores as special
characters
Demo

More Related Content

What's hot

Pragmatic Cloud Security Automation
Pragmatic Cloud Security AutomationPragmatic Cloud Security Automation
Pragmatic Cloud Security AutomationCloudVillage
 
Using Splunk/ELK for auditing AWS/GCP/Azure security posture
Using Splunk/ELK for auditing AWS/GCP/Azure security postureUsing Splunk/ELK for auditing AWS/GCP/Azure security posture
Using Splunk/ELK for auditing AWS/GCP/Azure security postureJose Hernandez
 
Enterprise grade firewall and ssl termination to ac by will stevens
Enterprise grade firewall and ssl termination to ac by will stevensEnterprise grade firewall and ssl termination to ac by will stevens
Enterprise grade firewall and ssl termination to ac by will stevensbuildacloud
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environmentTaswar Bhatti
 
What’s New at Cloudflare: New Product Launches
What’s New at Cloudflare: New Product LaunchesWhat’s New at Cloudflare: New Product Launches
What’s New at Cloudflare: New Product LaunchesCloudflare
 
EKS security best practices
EKS security best practicesEKS security best practices
EKS security best practicesJohn Varghese
 
«Real Time» Web Applications with SignalR in ASP.NET
«Real Time» Web Applications with SignalR in ASP.NET«Real Time» Web Applications with SignalR in ASP.NET
«Real Time» Web Applications with SignalR in ASP.NETAlessandro Giorgetti
 
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
Better Deployments with Sub Environments Using Spring Cloud and Netflix RibbonBetter Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
Better Deployments with Sub Environments Using Spring Cloud and Netflix RibbonVMware Tanzu
 
What You Should Know Before The Next DDoS Attack
What You Should Know Before The Next DDoS AttackWhat You Should Know Before The Next DDoS Attack
What You Should Know Before The Next DDoS AttackCloudflare
 
y3dips hacking priv8 network
y3dips hacking priv8 networky3dips hacking priv8 network
y3dips hacking priv8 networkidsecconf
 
Reach: Solving AWS Networking Problems Faster
Reach: Solving AWS Networking Problems FasterReach: Solving AWS Networking Problems Faster
Reach: Solving AWS Networking Problems FasterDanLuhring
 
Securing Internal Applications with Cloudflare Access
Securing Internal Applications with Cloudflare AccessSecuring Internal Applications with Cloudflare Access
Securing Internal Applications with Cloudflare AccessCloudflare
 
Azure Service Endpoints vs. Private Links
Azure Service Endpoints vs. Private LinksAzure Service Endpoints vs. Private Links
Azure Service Endpoints vs. Private LinksMatthias Güntert
 
Monitoring Highly Dynamic and Distributed Systems with NGINX Amplify
Monitoring Highly Dynamic and Distributed Systems with NGINX AmplifyMonitoring Highly Dynamic and Distributed Systems with NGINX Amplify
Monitoring Highly Dynamic and Distributed Systems with NGINX AmplifyNGINX, Inc.
 
Virus Bulletin 2012
Virus Bulletin 2012Virus Bulletin 2012
Virus Bulletin 2012Cloudflare
 
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...Cloudflare
 
AWS temporary credentials challenges in prevention detection mitigation
AWS temporary credentials   challenges in prevention detection mitigationAWS temporary credentials   challenges in prevention detection mitigation
AWS temporary credentials challenges in prevention detection mitigationJohn Varghese
 
Tips for Optimizing Web Performance
Tips for Optimizing Web PerformanceTips for Optimizing Web Performance
Tips for Optimizing Web PerformanceThousandEyes
 

What's hot (19)

Pragmatic Cloud Security Automation
Pragmatic Cloud Security AutomationPragmatic Cloud Security Automation
Pragmatic Cloud Security Automation
 
Using Splunk/ELK for auditing AWS/GCP/Azure security posture
Using Splunk/ELK for auditing AWS/GCP/Azure security postureUsing Splunk/ELK for auditing AWS/GCP/Azure security posture
Using Splunk/ELK for auditing AWS/GCP/Azure security posture
 
Enterprise grade firewall and ssl termination to ac by will stevens
Enterprise grade firewall and ssl termination to ac by will stevensEnterprise grade firewall and ssl termination to ac by will stevens
Enterprise grade firewall and ssl termination to ac by will stevens
 
Managing your secrets in a cloud environment
Managing your secrets in a cloud environmentManaging your secrets in a cloud environment
Managing your secrets in a cloud environment
 
What’s New at Cloudflare: New Product Launches
What’s New at Cloudflare: New Product LaunchesWhat’s New at Cloudflare: New Product Launches
What’s New at Cloudflare: New Product Launches
 
EKS security best practices
EKS security best practicesEKS security best practices
EKS security best practices
 
«Real Time» Web Applications with SignalR in ASP.NET
«Real Time» Web Applications with SignalR in ASP.NET«Real Time» Web Applications with SignalR in ASP.NET
«Real Time» Web Applications with SignalR in ASP.NET
 
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
Better Deployments with Sub Environments Using Spring Cloud and Netflix RibbonBetter Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
Better Deployments with Sub Environments Using Spring Cloud and Netflix Ribbon
 
What You Should Know Before The Next DDoS Attack
What You Should Know Before The Next DDoS AttackWhat You Should Know Before The Next DDoS Attack
What You Should Know Before The Next DDoS Attack
 
y3dips hacking priv8 network
y3dips hacking priv8 networky3dips hacking priv8 network
y3dips hacking priv8 network
 
Reach: Solving AWS Networking Problems Faster
Reach: Solving AWS Networking Problems FasterReach: Solving AWS Networking Problems Faster
Reach: Solving AWS Networking Problems Faster
 
Securing Internal Applications with Cloudflare Access
Securing Internal Applications with Cloudflare AccessSecuring Internal Applications with Cloudflare Access
Securing Internal Applications with Cloudflare Access
 
Azure Service Endpoints vs. Private Links
Azure Service Endpoints vs. Private LinksAzure Service Endpoints vs. Private Links
Azure Service Endpoints vs. Private Links
 
Monitoring Highly Dynamic and Distributed Systems with NGINX Amplify
Monitoring Highly Dynamic and Distributed Systems with NGINX AmplifyMonitoring Highly Dynamic and Distributed Systems with NGINX Amplify
Monitoring Highly Dynamic and Distributed Systems with NGINX Amplify
 
Virus Bulletin 2012
Virus Bulletin 2012Virus Bulletin 2012
Virus Bulletin 2012
 
SignalR
SignalRSignalR
SignalR
 
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...
Cloudflare Load Balancing for Monitoring Origin Server Health and Automatic F...
 
AWS temporary credentials challenges in prevention detection mitigation
AWS temporary credentials   challenges in prevention detection mitigationAWS temporary credentials   challenges in prevention detection mitigation
AWS temporary credentials challenges in prevention detection mitigation
 
Tips for Optimizing Web Performance
Tips for Optimizing Web PerformanceTips for Optimizing Web Performance
Tips for Optimizing Web Performance
 

Similar to Los Angeles AWS Users Group - Athena Deep Dive

使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLAmazon Web Services
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.Amazon Web Services
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Amazon Web Services
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightAmazon Web Services
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017David McDaniel
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Amazon Web Services
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAdam Book
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data AnalyticsAmazon Web Services
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaJulien SIMON
 

Similar to Los Angeles AWS Users Group - Athena Deep Dive (20)

使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
What is Amazon Athena
What is Amazon AthenaWhat is Amazon Athena
What is Amazon Athena
 

Recently uploaded

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 

Recently uploaded (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 

Los Angeles AWS Users Group - Athena Deep Dive

  • 2. What to expect… • Overview of Amazon Athena • Key Features • Customer Examples • Troubleshooting Query Errors • Q&A
  • 3. Challenges customers faced • Significant amount of work required to analyze data in Amazon S3 • Users often only have access to aggregated data sets • Managing a Hadoop cluster or data warehouse requires expertise
  • 4. Introducing Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 5. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  • 6. Athena is easy! • Log into the Console • Create a table – Type in a Hive DDL Statement – Use the console Add Table wizard • Start querying
  • 7. Athena is Highly Available • You connect to a service endpoint or log into the console • Athena uses warm compute pools across multiple Availability Zones • Your data is in Amazon S3, which is also highly available and designed for 99.999999999% durability
  • 8. Query data directly from S3 • No loading of data • Query data in its raw format – Text, CSV, JSON, weblogs, AWS service logs – Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  • 9. Athena is ANSI SQL Compliant • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key – (date, time, custom keys) – e.g., Year, Month, Day, Hour or Customer Key, Date
  • 10. Athena’s internals are familiar Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality • Complex data types • Multitude of formats • Supports data partitioning
  • 11. Multiple formats supported • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Apache Parquet & Apache ORC, & AVRO
  • 12. Athena is Fast • Tuned for performance • Automatically parallelizes queries • Results are streamed to console • Results also stored in S3 • Improve Query performance – Compress your data – Use columnar formats
  • 13. Athena is cost effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free • Save by using compression, columnar formats, partitions
  • 14. Who is Athena for? • Any one looking to process data stored in Amazon S3 – Data coming IOT Devices, Apache Logs, Omniture logs, CF logs, Application Logs • Anyone who knows SQL – Both developers or Analysts • Ad-hoc exploration of data and data discovery • Customers looking to build a data lake on Amazon S3
  • 16. An example pipeline Ad-hoc access to raw data using SQL
  • 17. An example pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  • 18. Re-visiting the challenges Significant amount of work required to analyze data in Amazon S3 No ETL required. No loading of data. Query data where it lives Users often only have access to aggregated data sets Query data at whatever granularity you want Managing a Hadoop cluster or data warehouse requires expertise No infrastructure to manage
  • 20. Use the JDBC Driver
  • 22. Setting up QuickSight with Athena
  • 23. Setting up QuickSight with Athena
  • 24. Setting up QuickSight with Athena
  • 25. Setting up QuickSight with Athena
  • 26. Setting up QuickSight with Athena
  • 27. Programatic JDBC Access /* Setup the driver */ Properties info = new Properties(); info.put("user", "AWSAccessKey"); info.put("password", "AWSSecretAccessKey"); info.put("s3_staging_dir", "s3://S3 Bucket Location/"); Class.forName("com.amazonaws.athena.jdbc.AthenaDriver"); Connection connection = DriverManager.getConnection( "jdbc:awsathena://athena.us-east-1.amazonaws.com:443/", info);
  • 28. Create a table & execute a query /* Create a table */ Statement statement = connection.createStatement(); ResultSet queryResults = statement.executeQuery("CREATE EXTERNAL TABLE tableName ( Col1 String ) LOCATION ‘s3://bucket/tableLocation"); /* Execute a Query */ Statement statement = connection.createStatement(); ResultSet queryResults = statement.executeQuery("SELECT * FROM cloudfront_logs");
  • 29. Creating Tables & Querying Data
  • 30. Creating Tables – The basics Create Table Statements (or DDL) are written in Hive • High degree of flexibility • Schema on Read • Hive is SQL like but allows other concepts such “external tables” and partitioning of data • Data formats supported – JSON, TXT, CSV, TSV, Parquet and ORC (via Serdes) • Data in stored in Amazon S3 • Metadata is stored in an a metadata store
  • 31. Athena’s internal metastore • Stores Metadata – Table definition, column names, partitions • Highly available and durable • Requires no management • Access via DDL statements • Similar to a Hive Metastore
  • 32. Running a query from the console Run time and data scanned
  • 33. Converting to ORC & PARQUET • You can use Hive CTAS to convert data – CREATE TABLE new_key_value_store – STORED AS PARQUET – AS – SELECT col_1, col2, col3 FROM noncolumartable – SORT BY new_key, key_value_pair; • You can also use Spark to convert the file into PARQUET / ORC • 20 lines of Pyspark code, running on EMR – Converts 1TB of text data into 130 GB of Parquet with snappy conversion – Total cost $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 34. Pay by the query - $5/TB scanned • Pay by the amount of data scanned per query • Ways to save costs – Compress – Convert to Columnar format – Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 37. Can’t see data after creating a table? • Verify that the input LOCATION is correct on S3 – Buckets should be specified as s3://name/ or s3://name/subfolder/ – s3://us-east-1.amazonaws.com/bucket/path will throw an error s3://bucket/path/ • Did the table have partitions ? – MSCK Repair Table for – s3://mybucket/athena/inputdata/year=2016/data.csv – s3://mybucket/athena/inputdata/year=2015/data.csv – s3://mybucket/athena/inputdata/year=2014/data.csv – Alter Table Add Partition for – s3://mybucket/athena/inputdata/2016/data.csv – s3://mybucket/athena/inputdata/2015/data.csv – s3://mybucket/athena/inputdata/2014/data.csv – ALTER TABLE Employee ADD – PARTITION (year=2016) LOCATION s3://mybucket/athena/inputdata/2016/' – PARTITION (year=2015) LOCATION s3://mybucket/athena/inputdata/2015/' – PARTITION (year=2014) LOCATION s3://mybucket/athena/inputdata/2014/'
  • 38. Did you correctly define your partitions?
  • 39. Reading JSON data • Make sure you are using the right Serde – Native JSON Serde org.apache.hive.hcatalog.data.JsonSerDe – OpenX SerDe (org.openx.data.jsonserde.JsonSerDe) • Make sure JSON record is a single line • Generate your data in case-insensitive columns • Provide an option to ignore malformed records CREATE EXTERNAL TABLE json ( a string, b int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true') LOCATION 's3://bucket/path/';
  • 40. Access denied errors? • Check the IAM Policy – Refer to the Getting Started Documentation • Check the bucket ACL – Both the read bucket and write bucket
  • 41. Table Names • Use backticks if table names begin with an underscore. For example: CREATE TABLE myUnderScoreTable ( `_id` string, `_index`string, • Table name can only have underscores as special characters
  • 42. Demo