Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

997 views

Published on

Organisations are increasingly gaining insight and knowledge from a number of IoT, API, clickstream, unstructured, and log data sources. Learn how AWS Glue makes it easy to build and manage enterprise-grade data pipelines to ingest, clean, transform, and automatically catalogue data, which enables a variety of use cases such as ad-hoc analytics, data warehousing, big data analysis, and machine learning. Also, find out how to intergrate an end-to-end CI/CD pipeline to automate the release management process for your serverless data pipelines.

  • Be the first to comment

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

  1. 1. S U M M I T SYDNEY
  2. 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Building Serverless Analytics Pipelines with AWS Glue Tom McMeekin Solutions Architect Amazon Web Services Drew Paterson Solutions Architect Amazon Web Services
  3. 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T There are more people accessing data And more requirements for making data available
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering Data stewardship Data pipelines Data structures Data lakes Extract Transform Load Data modelling Data marts Data warehouse
  5. 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Serverless data catalogue and ETL service
  6. 6. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Crawlers Amazon S3 Data Lake Storage AWS Glue Data Catalogue OLTP ERP CRM LOB Devices Web Sensors Social Automatically build your Data Catalogue and keep it in sync Built-in classifiers; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless
  8. 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Data Catalogue Amazon Athena Amazon Redshift Amazon EMR Amazon QuickSight Amazon SageMaker Amazon S3 Data Lake Storage Search metadata for data discovery Single view across all users, accounts, and workloads AWS Glue Data Catalogue
  9. 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  10. 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Use AWS Glue to cleanse, prep, and move Serverless Apache Spark or Python environment Auto-generate, write or bring your own Python or Scala code Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalogue
  11. 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Apache Spark and AWS Glue ETL AWS Glue builds on Apache Spark to offer ETL specific functionality Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrame Apache SparkSQL AWS Glue ETL Apache Spark is a distributed data processing engine for complex analytics
  12. 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DataFrames Core data structure for SparkSQL Like structured tables Need schema up-front Each row has same structure Suited for SQL-like analytics DataFrames and DynamicFrames DynamicFrames Like DataFrames for ETL Designed for processing semi-structured data, e.g. JSON, Avro, Apache logs ...
  13. 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Developer Endpoints / Notebooks Raw Dataset Amazon SageMaker Notebook Optimised Dataset Connect your IDE to an AWS Glue development endpoint Environment to interactively develop, debug, and test ETL code AWS Glue Data Catalouge
  14. 14. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  15. 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T • Specify the capacity that gets allocated to each job • Pay only for the resources you consume • Auto-configure VPC and role-based access • Connect to on-premises JDBC data stores as source There is no need to provision, configure, or manage servers AWS Glue: Job Execution - Serverless VPC Amazon RDS AWS Glue Corporate data center Database AWS Direct Connect
  16. 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Three ways to orchestrate an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Schedule driven Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting Work backwards from a daily SLA deadline
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Event driven Let Amazon CloudWatch Events and AWS Lambda drive the pipeline Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting
  19. 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T State machine–driven Let AWS Step Functions drive the pipeline
  20. 20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  21. 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering DevOps CI/CD Canary deployments Feature flags Chaos engineering Configuration management
  22. 22. CI/CD for AWS Glue ETL AWS CodePipeline • Help Data Engineers write quality code • Automate the ETL job release management process • Mitigate risk
  23. 23. CI/CD for AWS Glue ETL AWS CodePipeline pipe_line_template.yaml etl_job.py live_test.py AWS CodeCommit
  24. 24. CI/CD for AWS Glue ETL AWS CloudFormation Amazon S3 (Raw data) Amazon S3 (Test data) AWS CodePipeline AWS CodeCommit pipe_line_template.yaml etl_job.py Role
  25. 25. CI/CD for AWS Glue ETL Amazon S3 (Raw data) Amazon S3 (Test data) AWS Glue Data Catalogue AWS CodeBuild AWS CloudFormation AWS CodeCommit live_test.py
  26. 26. CI/CD for AWS Glue ETL Amazon Athena AWS CodeBuild AWS CloudFormation AWS CodePipeline Amazon S3 (Data Lake) Amazon S3 (Test Data) SELECT count(*) FROM ”sales".”data_lake”; SELECT count(*) FROM ”sales_parquet".”test_data"; AWS CodeCommit ✓
  27. 27. CI/CD for AWS Glue ETL AWS CodeCommit AWS CodeBuild AWS CloudFormation AWS CloudFormation AWS CodePipeline Amazon S3 (Raw data) Amazon S3 (Prd data) pipe_line_template.yaml etl_job.py Role
  28. 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Go learn • Remember the three steps to build a serverless data pipeline • Use AWS Glue features • Leverage the breadth of the AWS Platform • Scan your badge to receive links to learning resources
  29. 29. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tom McMeekin Drew Paterson

×