Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon Athena Capabilities and Use Cases Overview

5,536 views

Published on

Need to start querying data instantly? Amazon Athena an interactive query service that makes it easy to interactive queries on data in Amazon S3, using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately.

In this presentation, we will show you how Amazon Athena makes it easy it is to query your data stored in S3

  • ...My Scandalous Secret to Crushing The Odds So Effortlessly... ♣♣♣ http://t.cn/A6hPRSfx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Earn Up To $316/day! Easy Writing Jobs from the comfort of home! ➤➤ http://t.cn/AieXSfKU
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Ordinary Guy Retires After Winning The Lotto 7 Times ●●● http://t.cn/Airfq84N
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Amazon Athena Capabilities and Use Cases Overview

  1. 1. Amazon Athena Prajakta Damle, Roy Hasson and Abhishek Sinha
  2. 2. Amazon Athena Prajakta Damle, Roy Hasson and Abhishek Sinha
  3. 3. What to Expect from the Session 1. Product walk-through of Amazon Athena and AWS Glue 2. Top-3 use-cases 3. Demos 4. Hands-on Labs http://amazonathenahandson.s3- website-us-east-1.amazonaws.com/ 5. Q&A
  4. 4. Emerging Analytics Architecture
  5. 5. Emerging Analytics Architecture Serverless Compute Storage Visualization Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Data Processing Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search AWS Glue ETL & Data Catalog Amazon Kinesis Firehose Real-Time Data Streaming AWS Lambda Trigger-based Code Execution Amazon Redshift Spectrum Fast @ Exabyte scale Amazon S3 Exabyte-scale Object Storage AWS Glue Data Catalog Hive-compatible Metastore Ingestion AWS SnowBall Petabyte-scale Data Import AWS SnowMobile Exabyte-scale Data Import Amazon Kinesis Streaming Ingestion
  6. 6. Building a Data Lake on AWS Kinesis Firehose Athena Query Service Amazon Glue
  7. 7. Emerging Analytics Architecture Serverless Compute Storage Visualization Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Data Processing Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search AWS Glue ETL & Data Catalog Amazon Kinesis Firehose Real-Time Data Streaming AWS Lambda Trigger-based Code Execution Amazon Redshift Spectrum Fast @ Exabyte scale Amazon S3 Exabyte-scale Object Storage AWS Glue Data Catalog Hive-compatible Metastore Ingestion AWS SnowBall Petabyte-scale Data Import AWS SnowMobile Exabyte-scale Data Import Amazon Kinesis Streaming Ingestion
  8. 8. Amazon Athena (launched at re:Invent 2016) • Serverless query service for querying data in S3 using standard SQL, with no infrastructure to manage • No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and window functions • Support for multiple data formats include text, CSV, TSV, JSON, Avro, ORC, Parquet; can query data encrypted with KMS • $5/TB scanned. Pay per query only when you’re running queries based on data scanned. If you compress your data, you pay less and your queries run faster Amazon Athena
  9. 9. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  10. 10. Amazon Athena is Easy To Use • Console • JDBC • BI Tools • Query Builders • API
  11. 11. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • Stream data from directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  12. 12. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  13. 13. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  14. 14. Amazon Athena Supports Multiple Data Formats • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Apache Parquet & Apache ORC • AVRO
  15. 15. Emerging Analytics Architecture Serverless Compute Storage Visualization Amazon QuickSight Fast, easy to use, cloud BI Analytic Notebooks Jupyter, Zeppelin, HUE Data Processing Amazon AI ML/DL Services Amazon Athena Interactive Query Amazon EMR Managed Hadoop & Spark Amazon Redshift + Spectrum Petabyte-scale Data Warehousing Amazon Elasticsearch Real-time log analytics & search AWS Glue ETL & Data Catalog Amazon Kinesis Firehose Real-Time Data Streaming AWS Lambda Trigger-based Code Execution Amazon Redshift Spectrum Fast @ Exabyte scale Amazon S3 Exabyte-scale Object Storage AWS Glue Data Catalog Hive-compatible Metastore Ingestion AWS SnowBall Petabyte-scale Data Import AWS SnowMobile Exabyte-scale Data Import Amazon Kinesis Streaming Ingestion
  16. 16. AWS Glue automates the undifferentiated heavy lifting of ETL Automatically discover and categorize your data making it immediately searchable and queryable across data sources Generate code to clean, enrich, and reliably move data between various data sources; you can also use their favorite tools to build ETL jobs Run your jobs on a serverless, fully managed, scale-out environment. No compute resources to provision or manage. Discover Develop Deploy
  17. 17. AWS Glue: Components Data Catalog  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting Job Authoring  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer-centric – editing, debugging, sharing
  18. 18. Use Case 1: Ad-hoc analysis
  19. 19. Ad-hoc querying S3 Athena ad-hoc query
  20. 20. Ad-hoc queries EMR Athena ETL/ Analytics ad-hoc query Quicksight S3 JDBC
  21. 21. Adhoc analysis on historical data Redshift EMR Cluster Athena S3 ETL ad-hoc query copy Kinesis Stream firehose
  22. 22. Creating Databases and Tables 1. Use the Hive DDL statement directly from the console or JDBC 2. Use the Crawlers 3. Use the API
  23. 23. Creating Tables - Concepts • Create Table Statements (or DDL) are written in Hive • High degree of flexibility • Schema on Read • Hive is SQL like but allows other concepts such “external tables” and partitioning of data • Data formats supported – JSON, TXT, CSV, TSV, Parquet and ORC (via Serdes) • Data in stored in Amazon S3 • Metadata is stored in an a Athena Data Catalog or Glue Data Catalog
  24. 24. Schema on Read Versus Schema on Write Schema on write Create a Schema  Transform your data into the schema  Load the data and Query the data Good for repetition and performance Schema on Read Store raw data  Create schema on the fly Good for exploration, and experimentation
  25. 25. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' External = creates a view of this data. When you delete the table, the data is not deleted
  26. 26. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' Location = where data is stored. In Athena this is mandated to be in Amazon S3
  27. 27. Creating Tables – Parquet CREATE EXTERNAL TABLE db_name.taxi_rides_parquet ( vendorid STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, ratecode INT, passenger_count INT, trip_distance DOUBLE, fare_amount DOUBLE, total_amount DOUBLE, payment_type INT ) PARTITIONED BY (YEAR INT, MONTH INT, TYPE string) STORED AS PARQUET LOCATION 's3://serverless-analytics/canonical/NY-Pub’ TBLPROPERTIES ('has_encrypted_data'=’true');
  28. 28. Creating Tables – Nested JSON CREATE EXTERNAL TABLE IF NOT EXISTS fix_messages ( `bodyLength` int, `defaultAppVerID` string, `encryptMethod` int, `msgSeqNum` int, `msgType` string, `resetSeqNumFlag` string, `securityRequestID` string, `securityRequestResult` int, `securityXML` struct <version:int, header:struct<assetClass:string, tierLevelISIN:int, useCase:string>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://my_bucket/fix/' TBLPROPERTIES ('has_encrypted_data'='false');
  29. 29. CSV SerDe ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',', 'colelction.delim' = '|', 'mapkey.delim' = ':', 'escape.delim' = '’ ) Does not support removing quote characters from fields. But different primitive types ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "” ) All fields must be defined as type String
  30. 30. Grok Serde CREATE EXTERNAL TABLE `mygroktable`( 'SYSLOGBASE' string, 'queue_id' string, 'syslog_message' string ) ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe' WITH SERDEPROPERTIES ( 'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{10,11}', 'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://mybucket/groksample';
  31. 31. Athena Cookbook
  32. 32. AWS Big Data Blog
  33. 33. Demo: 1
  34. 34. Data Catalog & Crawlers
  35. 35. Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  36. 36. Glue Data Catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark etc. We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through Crawlers.
  37. 37. Glue Data Catalog: Crawlers Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
  38. 38. Data Catalog: Table details Table schema Table properties Data statistics Nested fields
  39. 39. Data Catalog: Version control List of table versionsCompare schema versions
  40. 40. Data Catalog: Detecting partitions file 1 file N… file 1 file N… date=10 date=15 … month=Nov S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  41. 41. Data Catalog: Automatic partition detection Automatically register available partitions Table partitions
  42. 42. Running Queries is Simple Run time and data scanned
  43. 43. Demo 2
  44. 44. Use Case 2: Data Lake Analytics
  45. 45. Data Lake Architecture
  46. 46. How is it different Layout data to allow for better performance and lower cost 1. Conversion of data to columnar formats 2. Data partitioning 3. Optimizing for file sizes 4. Customers may also do ETL to clean up the data or create views
  47. 47. Benefits of conversion • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  48. 48. Job authoring in AWS Glue  Python code generated by AWS Glue  Connect a notebook or IDE to AWS Glue  Existing code brought into AWS Glue You have choices on how to get started You can use Glue for data conversion and ETL
  49. 49. 1. Customize the mappings 2. Glue generates transformation graph and Python code 3. Connect your notebook to development endpoints to customize your code Job authoring: Automatic code generation
  50. 50.  Human-readable, editable, and portable PySpark code  Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data  Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries  Collaborative: share code snippets via GitHub, reuse code across jobs Job authoring: ETL code
  51. 51. Job Authoring: Glue Dynamic Frames Dynamic frame schema A C D [ ] X Y B1 B2 Like Spark’s Data Frames, but better for: • Cleaning and (re)-structuring semi-structured data sets, e.g. JSON, Avro, Apache logs ... No upfront schema needed: • Infers schema on-the-fly, enabling transformations in a single pass Easy to handle the unexpected: • Tracks new fields, and inconsistent changing data types with choices, e.g. integer or string • Automatically mark and separate error records
  52. 52. Job Authoring: Glue transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y Adaptive and flexible C
  53. 53. Job authoring: Relationalize() transform Semi-structured schema Relational schema F K A B B C.X C. Y P K Valu e Offs et A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  54. 54. Job authoring: Glue transformations  Prebuilt transformation: Click and add to your job with simple configuration  Spigot writes sample data from DynamicFrame to S3 in JSON format  Expanding… more transformations to come
  55. 55. Job authoring: Write your own scripts Import custom libraries required by your code Convert to a Spark Data Frame for complex SQL-based ETL Convert back to Glue Dynamic Frame for semi-structured processing and AWS Glue connectors
  56. 56. Job Authoring: Leveraging the community No need to start from scratch. Use Glue samples stored in Github to share, reuse, contribute: https://github.com/awslabs/aws-glue-samples • Migration scripts to import existing Hive Metastore data into AWS Glue Data Catalog • Examples of how to use Dynamic Frames and Relationalize() transform • Examples of how to use arbitrary PySpark code with Glue’s Python ETL library Download Glue’s Python ETL library to start developing code in your IDE: https://github.com/awslabs/aws-glue-libs
  57. 57. Data Partitioning – Benefits • Separates data files by any column • Read only files the query needs • Reduce amount of data scanned • Increase query completion time • Reduce query cost
  58. 58. Data Partitioning – S3 • Prefer Hive compatible partition naming • [column_name = column_value] • i.e. s3://athena-examples/logs/year=2017/month=5/ • Support simple partition naming • i.e. s3://athena-examples/logs/2017/5/
  59. 59. Data Partitioning – Data Catalog ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’ ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/’ ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'), PARTITION (dt='2014-05-15’,country='IN’) ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://athena-examples/new_customers/zip=98040/state=WA’
  60. 60. File sizes matter
  61. 61. Data Lake Architecture
  62. 62. Use Case 3: Embedding Athena
  63. 63. Athena API • Asynchronous interaction model • Initiate a query, get query ID, retrieve results • Named queries • Save queries and reuse • Paginated result set • Max page size current at 1000 • Column data and metadata • Name, type, precision, nullable • Query status • State, start and end times • Query statistics • Data scanned and execution time
  64. 64. Hands On Labs • http://amazonathenahandson.s3-website-us- east-1.amazonaws.com/
  65. 65. Thank You!

×