AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)


Published on

Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift

GE Power & Water develops advanced technologies to help solve some of the world’s most complex challenges related to water availability and quality. They had amassed billions of rows of data on on-premises databases, but decided to migrate some of their core big data projects to the AWS Cloud. When they decided to transform and store it all in Amazon Redshift, they knew they needed an ETL/ELT tool that could handle this enormous amount of data and safely deliver it to its destination. In this session, Ryan Oates, Enterprise Architect at GE Water, shares his use case, requirements, outcomes and lessons learned. He also shares the details of his solution stack, including Amazon Redshift and Matillion ETL for Amazon Redshift in AWS Marketplace. You learn best practices on Amazon Redshift ETL supporting enterprise analytics and big data requirements, simply and at scale. You learn how to simplify data loading, transformation and orchestration on to Amazon Redshift and how build out a real data pipeline. Get the insights to deliver your big data project in record time.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ryan Oattes, Enterprise Architect, GE Adam Gantt, Solution Architect, Matillion November 29, 2016 BDA203 How GE Transformed Billions of Rows in Record Time Using Matillion ETL for Amazon Redshift
  2. 2. Speakers Ryan Oattes, Enterprise Architect, GE Digital Adam Gantt, Solution Architect, Matillion
  3. 3. What to Expect from the Session • Understand GE’s challenge and use case • Explore the technical architecture: Amazon Redshift and Matillion ETL • ELT approach • The role of AWS Marketplace • Lessons learned and tips • Deep dive technical demo of Matillion ETL for Amazon Redshift • Share experiences, benefits and lessons learned • Technical Q&A
  4. 4. Background and Requirement
  5. 5. GE, our company and requirement • General Electric has been in business for over 140 years, investing $5.4B annually in R&D (6% of revenue) • Augmenting our operational technology depth with digital • Focus on industrial Internet of Things, and creating insights based on business and machine data • Knew we needed best-in-class partners to let us focus on what we do best; GE migrating 9000 workloads into AWS
  6. 6. Our Challenge • Raise the bar on data warehouse scalability, integration, stability, and development velocity • Needed scalability for machine and business data, as GE increasingly digitizes • Self-serve BI strategy meant we had to maintain our current compute capabilities • Increasingly critical dependencies require rock-solid platform • Desired more intuitive and accessible analytics solution
  7. 7. Technical Journey • Had an on-premises, in-memory columnar database • Good, but hard to scale • We’re getting out of the business of managing infrastructure • Selected Amazon Redshift to replace the on-premises solution • Fully managed & scalable • Familiar SQL technology • Our ETL technology previously was a traditional enterprise ETL platform • Tried using it for our Amazon Redshift project but wasn’t working for us • Lots of manual SQL coding required • Deployment, management, scaling, etc. was hard, as it wasn’t cloud 1st • Wanted a “Cloud 1st” solution, ideally built for Amazon Redshift • An AWS SA recommended we look at Matillion • AWS Marketplace allowed us to PoC Matillion quickly and cost effectively
  8. 8. Solution Architecture Source data: SAP Data Warehouse: • Amazon Redshift Data Integration: • Matillion ETL for Amazon Redshift • HVR Data Visualisation: Tableau SAP 32 x DC1 Nodes Amazon Redshift Cluster Staging DWH Matillion ETL M3.Large ELT Tableau CDC Data Replication (HVR)
  9. 9. ELT Approach Amazon Redshift’s MPP columnar architecture is very fast at transforming data Matillion ETL uses a push-down architecture, transforming (join/aggregate/filter/calculate etc) data on the Amazon Redshift cluster directly This simplifies our infrastructure and scaling and the speed and proximity to the data helps developer productivity Same architecture can be achieved manually (coding), but not as productively as with a tool
  10. 10. Transform goals Solution Details – the Transforms One of our transformation jobs • Denormalise complex underlying SAP data structures → analysis ready Facts and Dimensions • “Clean up” data, ironing out, for instance, differences between configurations in business units/geographies • Add metrics, KPIs and measures for business to consume (and do so consistently)
  11. 11. Transform goals (cont.) Solution Details – the Transforms Transform detail • Use of ETL tool required to manage the complexity of the transforms • Graphical jobs help document, understand and share the business logic • Full range of transforms used, e.g., join, aggregate, filter, calculate, rename, convert type, map values, rank, etc. • ELT architecture means you can see data live in the job at each component, significantly streamlining development
  12. 12. Solution Details – Volumes & KPIs • Approx. 4 TB across 32 x Amazon Redshift compute intensive nodes • Transaction business data from SAP • Financial transactions, inventory movements, sales orders, etc. • Data is staged to Amazon Redshift via CDC (HVR) • Then micro-batched into warehouse. Transformed then upserted using ‘Table Update’ component (set to update/insert strategy) in Matillion ETL • Under the hood, this is doing an insert for any new rows, and an update for any existing rows • Runs every hour using schedule in Matillion scheduler
  13. 13. Deep Dive: Hands On Demonstration
  14. 14. Lessons learned and tips Some of the jobs require very large, complex joins across billions of rows • In a Matillion job, each component builds an underlying Amazon Redshift view. Amazon Redshift optimises the collection of views into a performant execution plan • To improve performance on the largest joins, we split the jobs up into sub-jobs, outputting to a table between each sub-job • Distribution and sort keys of these intermediary tables were set to optimise the subsequent join (using the Table Output component in Matillion)
  15. 15. Lessons learned and tips We record the last successful time of an upload in a state table in Amazon Redshift • The last time of a transform is read from the table • Data is processed from after that time only, then the table is updated • This is wrapped in transaction control, using the Matillion transaction control orchestration components SAP uses four character names for fields and tables • We use the rename column component Matillion, in “Text Mode”, to allow us to copy and paste a spreadsheet of mappings of the 4-char names to human readable names easily • A useful hack which saved hours
  16. 16. Lessons learned - AWS Marketplace AWS Marketplace genuinely added value in both PoC/procurement and also architecturally • Took away any security issues as it’s an AMI running in our own VPC • Same goes for performance – no data moving across the internet • Allowed us to try out several tools quickly and pick the best fit • Supported GE’s “fail fast” ethos • Made purchasing and therefore project launch faster and simpler – no small thing in a company the size of GE
  17. 17. Outcomes
  18. 18. Key Numbers 45%operating cost reduction 6months from ideation to go-live <$100PoC pilot cost
  19. 19. Outcomes for GE • Preserved performance and added stability • Simplified operations with managed Amazon Redshift solution • OPEX savings. No CAPEX required • Fast development from concept to reality • Highly resilient, seamless scaling
  20. 20. Q & A
  21. 21. Resources • Free 14-day trial: • • Tutorials: • • Support And Documentation: • • Web: • • Visit booth 2338 for a hands-on demo and AWS credits to support your trial/PoC
  22. 22. Thank you!
  23. 23. Remember to complete your evaluations!