AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

  1. 1. Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 김상필 솔루션즈 아키텍트
  2. 2. 목차 • 서버리스 대화식 쿼리 서비스, Amazon Athena 소개 • 완전 관리형 ETL 서비스, AWS Glue 소개 2
  3. 3. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & insights AWS 빅데이터 분석 아키텍처
  4. 4. AWS Data PipelineAWS Database Migration Service EMR 분석 Amazon Glacier S3 저장수집 Amazon Kinesis Direct Connect Amazon Machine Learning Amazon Redshift DynamoDBAWS IoT AWS Snowball QuickSight Amazon Athena EC2 Amazon Elasticsearch Service Lambda AWS Glue
  5. 5. Amazon Athena 소개
  6. 6. 기존의 어려움 • Significant amount of work required to analyze data in Amazon S3 • Users often only have access to aggregated data sets • Managing a Hadoop cluster or data warehouse requir es expertise
  7. 7. Amazon Athena 란? Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  8. 8. Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgra des Highly Available • Connect to a service endpoint or log into the console • Uses warm compute pools across multiple AZs • Your data is in Amazon S3 Easy to use • Log into the Console • Create a table • Type in a Hive DDL Statement • Use the console Add Table wizard • Start querying Amazon Athena 특징
  9. 9. Amazon S3에 있는 데이터를 직접 쿼리 • No loading of data • Query data in its raw format • Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performa nce and lowest cost • No ETL required • Stream data from directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  10. 10. ANSI SQL 사용 • Start writing ANSI SQL • Support for complex joins, nested q ueries & window functions • Support for complex data types (arra ys, structs) • Support for partitioning of data by a ny key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Cu stomer Key, Date
  11. 11. 기존의 친숙한 기술들 사용 • Used for SQL Queries • In-memory distributed query engine • ANSI-SQL compatible with extensions • Used for DDL functionality • Complex data types • Multitude of formats • Supports data partitioning
  12. 12. Amazon Athena 지원 데이터 포맷 • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Apache Parquet & Apache ORC • AVRO support – coming soon
  13. 13. Amazon Athena의 빠른 속도 • Tuned for performance • Automatically parallelizes queries • Results are streamed to console • Results also stored in S3 • Improve Query performance • Compress your data • Use columnar formats
  14. 14. Amazon Athena의 비용 효율성 • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free • Save by using compression, columnar formats, partitions
  15. 15. 데이터 분석 파이프라인 예
  16. 16. 데이터 분석 파이프라인 예 Ad-hoc access to raw data using SQL
  17. 17. 데이터 분석 파이프라인 예 Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  18. 18. 기존 어려움들의 해결 • Significant amount of work required to analyze data in Amazon S3 • No ETL required. No loading of data. Query data where it lives • Users often only have access to aggregated data sets • Query data at whatever granularity you want • Managing a Hadoop cluster or data warehouse requires expertise • No infrastructure to manage
  19. 19. Amazon Athena 접속
  20. 20. Simple Query editor with key bindings
  21. 21. Autocomplete functionality
  22. 22. Catalog
  23. 23. Tables and columns
  24. 24. Can also see a detailed view in the catalog tab
  25. 25. You can also check the properties. Note the location.
  26. 26. JDBC 드라이버 지원
  27. 27. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Amazon QuickSight를 통한 Athena 접속 지원
  28. 28. 테이블 생성 및 데이터 쿼리
  29. 29. 테이블 생성 • Create Table Statements (or DDL) are written in Hive • High degree of flexibility • Schema on Read • Hive is SQL like but allows other concepts such “external tables” and partitioning of data • Data formats supported – JSON, TXT, CSV, TSV, Parquet a nd ORC (via Serdes) • Data in stored in Amazon S3 • Metadata is stored in an a metadata store
  30. 30. Athena의 내부 메타데이터 저장소 • Stores Metadata • Table definition, column names, partitions • Highly available and durable • Requires no management • Access via DDL statements • Similar to a Hive Metastore
  31. 31. 간단한 쿼리 실행 Run time and data scanned
  32. 32. PARQUET • Columnar format • Schema segregated into footer • Column major format • All data is pushed to the leaf • Integrated compression and in dexes • Support for predicate pushdo wn ORC • Apache Top level project • Schema segregated into footer • Column major with stripes • Integrated compression, indexe s, and stats • Support for Predicate Pushdow n Apache Parquet 및 Apache ORC – 컬럼기반 포맷
  33. 33. 쿼리 수행 당 비용 - $5/TB 스캔 • Pay by the amount of data scanned per q uery • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text fi les 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apach e Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parque t 34x faster 99% less data scanned 99.7% cheaper
  34. 34. Athena는 Amazon Redshift 및 Amazon EMR 보완 Amazon S3 EMR Athena QuickSight Redshift
  35. 35. 완전 관리형 ETL 서비스 AWS Glue
  36. 36. Fivetran AWS의 많은 ETL 파트너들… … 실제로는 툴보다 매뉴얼 코드
  37. 37. ETL Data Warehousing Business Intelligence 70% of time spent here Amazon Redshift Amazon QuickSight 분석에서 ETL 이 가장 시간을 많이 소모
  38. 38. 1990 2000 2010 2020 Generated Data Available for Analysis Data Volume The Data Gap 데이터의 갭 초래
  39. 39. ü Cataloging data sources ü Identifying data formats and data types ü Generating Extract, Transform, Load code ü Executing ETL jobs; managing dependencies ü Handling errors ü Managing and scaling resources Glue는 ETL 작업을 자동화
  40. 40. Data Catalog § Hive metastore compatible metadata repository of data sources. § Crawls data source to infer table, data type, partition format. Job Execution § Runs jobs in Spark containers – automatic scaling based on SLA. § Serverless - only pay for the resources you consume. Job Authoring § Generates Python code to move data from source to destination. § Edit with your favorite IDE; share code snippets using Git. AWS Glue 구성요소
  41. 41. Glue 데이터 카달로그 Discover and organize your data sets
  42. 42. Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools such as Hive, Presto, Spark, etc. We added a few extensions: § Search metadata for data discovery § Connection info – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through crawlers. Glue 데이터 카달로그
  43. 43. Automatic schema inference: • Built-in classifiers detect file type and extract schema: record structure and data types. • Add your own or share with others in the Glue community - It's all Grok and Python. Auto-detects Hive-style partitions, grouping similar files into one table. Run crawlers on schedule to discover new data and schema changes. Serverless – only pay when crawls run. 크롤러 : 데이터 카달로그의 자동 생성
  44. 44. Glue에서의 작업 작성 Make ETL job authoring like code development using your own tools
  45. 45. 1. Pick sources and targets from the data catalog 2. Glue generates transformation graph and Python code 3. Specify trigger condition Every Friday at 3PM GMT Source table @ Amazon S3 Transform Relationalize Transform Filter table Target table @ Amazon Redshift Target table @ Amazon Redshift 자동 코드 생성
  46. 46. § Human-readable code run on a scalable platform, PySpark § Forgiving in the face of failures – handles bad data and crashes § Flexible: handles complex semi-structured data, and adapts to source schema changes Glue ETL 스크립트의 유연성
  47. 47. Glue integrates job authoring and execution with your preferred Git services. Push job code to your Git repository, automatically pulls the latest on job invocation. Customize ETL jobs in your favorite IDE – no need to learn new tools No need to start from scratch. AWS CodeCommit Git 통합
  48. 48. 오케스트레이션 & 자원관리 Fully managed, serverless job execution
  49. 49. Compose jobs globally with event- based dependencies § Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms § Schedule-based: e.g., time of day § Event-based: e.g., data availability, job completion § External sources: e.g., AWS Lambda Marketing: Ad-spend by customer segmentData based >10 MB new Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment ad-click logs weekly sales Data based 작업 구성 및 트리거
  50. 50. Split by message type Application #1 – click logs 3 different message types … summarize message type summarize message type Example: Dynamic number of jobs based on application type and number of message types summarize message typeApplication #2 – click logs 5 different message types Application #3 – click logs 4 different message types § Add jobs dynamically as graph unfolds - makes data dependent orchestration possible § Glue provides fault-tolerant orchestration - retries on job failure § Monitoring and metrics - job run history and event tracking for debugging 동적 오케스트레이션
  51. 51. § Warm pools: pre-configured fleets of instances to reduce job startup time § Auto-configure VPC and role-based access § Automatically scale resources to meet SLA and cost objectives § You pay only for the resources you consume while consuming them. There is no need to provision, configure, or manage servers Customer VPC Customer VPC Warm pool of instances 서버리스 작업 실행
  52. 52. So that's the basics of what we are doing. You can sign up for a preview at We should start adding people soon. Glue 프리뷰 신청
  53. 53. 감사합니다