Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless Big Data Analytics with Amazon Athena and QuickSight

274 views

Published on

Serverless Big Data Analytics with Amazon Athena and QuickSight

  • Be the first to comment

Serverless Big Data Analytics with Amazon Athena and QuickSight

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Robinson, Specialist SA, Analytics, EMEA 10 April 2018 Serverless Big Data Analytics with Amazon Athena and QuickSight
  2. 2. Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL
  3. 3. Serverless and Easy to Use No infrastructure or administration • Warm compute pools across multiple AZs • Data in S3, for HA and high durability Zero spin up time • Connect to a service endpoint • Start querying
  4. 4. Familiar OSS Technology Used for SQL Queries • In-memory distributed query engine • ANSI-SQL compatible with extensions Used for DDL functionality • Complex data types • Multitude of formats • Supports data partitioning • EXTERNAL tables – no impact on underlying data
  5. 5. ANSI SQL • Complex joins, nested queries, window functions • Complex data types (arrays, structs) • Partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour, Customer Key, Date
  6. 6. Open Data Formats • Text files, e.g. CSV, TSV, custom delimiter • Apache Web Logs, CloudTrail logs • JSON (simple, nested), AVRO • Columnar formats, e.g. Apache Parquet & Apache ORC • Logstash Grok for unstructured text files • Compressed files (Snappy, Zlib, GZIP, and LZO) • Encrypted data (SSE-S3, SSE-KMS, CSE-KMS) • Use large (128MB – 1GB) compressed files
  7. 7. Pay Per Query – 5$ Per TB Data Scanned • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  8. 8. Apache Parquet and Apache ORC • Columnar formats • Store data in columns, not rows • Support for predicate pushdown • Filter data where it lives • Schema segregated into footer • Integrated compression and indexes
  9. 9. data answers COLLECT STORE PROCESS/ ANALYZE CONSUME time to first answer Analytics Value Stream
  10. 10. Agile Analytics • Experiment • Invest in promising experiments • Fail fast • React quickly
  11. 11. Serverless Analytics AWSBrandGuidelines CONFIDENTIAL Donotcreatetitlesthatarelar thannecessary. Donotusetoosmallafont sizeonmainorsubtext. /complicated Featureillustrations. 0%-80%ofthe hitneyHTFfont. Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps STEPTITLEOFSTEP Amazon S3 Highly durable object storage AWS Glue Data catalog and managed ETL Amazon Athena Serverless interactive SQL queries Amazon QuickSight Business analytics service
  12. 12. Example: NYC Transportation AWSBrandGuidelines Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. otuseoverlydetailed/complicated ery;onlyusesimpleFeatureillustrations. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEPTITLEOFSTEP p.39 Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. re izeof font. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere PTITLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEP Raw S3 Data Canonical Data Amazon Athena Amazon Quicksigh t ETL Job Data Catalog describes describes uses
  13. 13. Three Sources of Raw Data
  14. 14. Use Glue to Crawl and ETL the Source Data Taxi csv Limo csv Taxi ETL Job 1.6 GB 94.8 MB Limo ETL Job 220.3 MB 18 MB Donotcreatetitlesthatarelarger llafont xtgoeshere ofthefontsizeof hitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtexthere Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. ovement teps TITLEOFSTEP Canonical Data parquet Data Catalog use
  15. 15. Start Querying with Amazon Athena • Run Glue crawler to create canonical table definition • Run some simple queries p.39 Donotcreatetitlesthatarelarger thannecessary. allafont btext. textgoeshere %ofthefontsizeof WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtexthere Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. smovement nsteps TITLEOFSTEP Canonical Data Amazon Athena Data Catalog describes uses
  16. 16. Visualise Your Data Lake with Amazon QuickSight Donotcreatetitlesthatarelarger thannecessary. smallafont rsubtext. aintextgoeshere 80%ofthefontsizeof tle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtexthere TLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. plainsmovement tweensteps TITLEOFSTEP Canonical Data Amazon QuicksightData Catalog Collaborate, Share, and Publish

×