Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Connecting Your Data Analytics Pipeline


Published on

Organisations involved in Big Data and Analytics spend a lot of time preparing data for analysis which often involves large-scale movement and transformation. In this session we will explore AWS Glue, a new service designed to assist with the process of cataloging, transforming and scheduling for your data pipeline.

Speaker: Cassandra Bonner, Solutions Architect, Amazon Web Services

Published in: Technology
  • If you want to enjoy the Good Life: making money in the comfort of your own home with just your laptop, then this is for YOU... ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Connecting Your Data Analytics Pipeline

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cassandra Bonner Solutions Architect, Amazon Web Services Level 200 Connecting your Data Analytics Pipeline
  2. 2. Ingest Data Analytics Landscape Serving Amazon Redshift Data analysts Data scientists Business users Engagement platforms Automation / events Speed (Near real-time) Scale (Batch) AWS Glue Sources Flat Files Amazon S3 Amazon RDS On Prem DB
  3. 3. We Have Lots of ETL Partners Amazon Redshift Partner Page for Data Integration
  4. 4. Why Would Anyone Hand-code? Brittle Error-prone Laborious
  5. 5. You Also Need to Maintain this Code ► As data formats change ► As target schemas change ► As you add sources ► As data volume grows
  6. 6. Code is Flexible Code is Powerful You can unit test You can deploy with other code You know your dev tools
  7. 7. ETL is the Most Time-consuming Part of Analytics ETL Data Warehousing Business Intelligence Large % of time spent here Amazon Redshift Amazon QuickSight
  8. 8. The Data Gap This Leads to Dark Data 1990 2000 2010 2020 Generated Data Data Volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Centre Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  9. 9. Glue Automates The Undifferentiated Heavy-lifting Of ETL
  10. 10. AWS Glue: High-Level View
  11. 11. AWS Glue: Components Data Catalog  Hive metastore compatible metadata repository of data sources.  Crawls data source to infer table, data type, partition format. Job Execution  Runs jobs in Spark containers – automatic scaling based on SLA.  Glue is serverless - only pay for the resources you consume. Job Authoring  Generates Python code to move data from source to destination.  Edit with your favourite IDE; share code snippets using Git.
  12. 12. Glue Data Catalog Discover and organise your data sets
  13. 13. Glue Data Catalog We added a few extensions:  Search metadata  Connection info  Classification  Versioning
  14. 14. Crawlers: Populate Data Catalog Crawler Custom Classifiers Built—in Classifiers Amazon S3 Amazon RDS Amazon Redshift JDBC Data Stores 3 1 2 Connection 4 5
  16. 16. Job Authoring in Glue Make ETL job authoring like code development using your own tools
  17. 17. Authoring Jobs 1 Source 2 Target 3 4 Transform Data 6 Generates 5 Triggers PySpark script
  18. 18. Automatic Code Generation Domain Driven Design, Eric Evans
  19. 19. Orchestration & Resource Management Fully managed, serverless job execution
  20. 20. Running & Monitoring Jobs JDBC Amazon S3 Amazon RDS Amazon Redshift Data Sources JDBC Amazon S3 Amazon RDS Amazon Redshift Data Targets Triggers 1 2 Extracts data Job 4 Loads data 5 3 Transforms Runs
  22. 22. Sign Up For Glue Preview You can sign up for a preview at
  23. 23. Thank you!