Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake

2,141 views

Published on

In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base.

  • Be the first to comment

ABD327_Migrating Your Traditional Data Warehouse to a Modern Data Lake

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:Invent Migrating Your Traditional Data Warehouse to a Modern Data Lake Vidhya Srinivasan, General Manager, Amazon Redshift Balaji Muthuramalingam, Executive Director, Data & Analytics at 21st Century Fox November 28, 2017
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon’s Analytics Architecture Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon DynamoDB Amazon ES Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon AI
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. “Amazon Redshift has the largest adoption of BDW in the cloud.” “With more than 5,000 deployments, Amazon Redshift has the largest data warehouse deployments in the cloud – some over 10 petabytes in size.” AWS received a score of 5/5 (the highest score possible) in the: customer base, market awareness, ability to execute, road map, support, and partners criteria Forrester Wave Big Data Warehouse Q2 2017
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift – Data Warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scales from gigabytes to exabytes Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance $ Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Analyze optimized data formats on direct-attached disks, and all open data formats in Amazon S3
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e S3 data lakeAmazon Redshift data Redshift Spectrum query engine Exabyte Amazon Redshift SQL queries against S3 Join data across Amazon Redshift and S3 Scale compute and storage separately Stable query performance and unlimited concurrency Parquet, ORC, Grok, Avro, & CSV data formats Pay only for the amount of data scanned
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum Q u e r y y o u r d a t a l a ke Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Redshift Spectrum Scale-out serverless compute Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY …
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spectrum: Exabyte query in less than three minutes SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20; • Roughly 140 TB of customer item order detail records for each day over past 20 years • 190 million files across 15,000 partitions in S3 • One partition per day for USA and rest of world • Total data size is over an exabyte Optimization: • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Amazon Redshift query optimizer..40X Hive (1000 nodes) Redshift Spectrum 5 years 155 seconds * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NUVIAD - Data Lake Analytics with Redshift Spectrum Seamlessly analyzing open file formats directly in Amazon S3 to provide fresh, up-to-the-minute insights Unlimited analytics and query concurrency with Amazon Redshift, and unlimited data capacity with Amazon S3 Scaling compute separately from storage in Amazon S3 for flexibility, fast performance and cost-effectiveness “Spectrum is a game changer for us. Reports that took minutes to produce are now delivered in seconds and we like the ability scale compute on-demand to query petabytes of data in S3 in various open file formats.”– Rafi Ton, CEO, NUVIAD NUVIAD is a mobile marketing platform providing professional marketers, agencies and local businesses with hyper-targeted analytics at petabyte scale AWS Glue Amazon S3 Data sources Amazon Redshift Redshift Spectrum BI Tools
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is widely available Ireland Frankfurt London Beijing Mumbai Seoul Singapore Sydney Tokyo Sao Paulo US East – N Virginia US East – Ohio US West – Oregon US West – N California GovCloud Canada – Central, Montreal Currently Available
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selected Amazon Redshift Customers
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Selected Amazon Redshift Partners Data Integration Systems IntegratorsBusiness Intelligence
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recent and upcoming launches
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Dense Compute Node - DC2 2X Performance @ Same Price as DC1 3x more I/O with 30% better storage utilization than DC1 “We saw a 9x reduction in month-end reporting time with Amazon Redshift dc2 nodes as compared to dc1” - Bradley Todd, Technical Architect, Liberty Mutual NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. BI / Dashboard tools Analytics and Amazon Redshift Queries go to leader node 1 If cache contains query result, it’s returned with no processing 2 If query is not in cache, it’s executed and result is cached 3 • In-memory leader node cache, resulting in sub-second response • Transparent – it just works • Skip WLM, Skip processing, Skip optimization • Cache persists across sessions • Caching frees up the Amazon Redshift cluster, increasing performance for other non- repetitive queries RESULTS CACHE QUERY_ID RESULT QUERY_ID RESULT Result Caching - Sub-second query response times
  15. 15. Result Caching: From the lab • Higher is better! (Queries per hour) • Read-write workload with a mix of small and large queries, Inserts, Copy and Vacuum • 4-node ds2.8xL cluster Dashboard Heavy Reporting 138 8 2979 117 QUERY THROUGHPUT (QPH) WITH RESULT CACHING No Caching Caching
  16. 16. Result Caching: A customer perspective • Lower is better! (Query Latency) • 4-node dc2.8xL cluster • Tableau dashboard; 10-user test Caching No Caching “That’s not a mistake...the results for average execution time on the caching test run were sub-second and so don't show up on the y-axis at this scale” Various dashboard queries (names removed for confidentiality)
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S h or t Q u e ry A c c e lerati on – E x p re s s l a ne f o r S h or t q u e rie s BI / Dashboard tools Analytics and Amazon Redshift • Short queries do not get stuck behind long running queries • Higher throughput, less variability • Customized for your workload • Transparent – it just works! Machine Learning Classifier
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Short Query Acceleration: Results No SQA; 5 concurrency SQA; 5 concurrency “This configuration showed a distinct improvement in short query runtimes with the SQA feature enabled. Many of the shortest queries saw a 5x or greater improvement while the longer running queries saw a corresponding increase. This is exactly how we expect the feature to work.”  Average wait time reduces from 36 seconds to 0 for queries that execute under a second  P90 wait time on a very busy cluster reduces from 370 seconds to 32.1 seconds
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested data support • Analyze nested and semi-structured data in Amazon S3 with Spectrum • Allows easy ETL of nested data in to Amazon Redshift using CTAS • Support for open file formats: Parquet, ORC, JSON, Ion and AVRO • Uses dot notation to extend your existing SQL s3data.clickStream: << { “session_time”: “20171013 14:05:00”, “clicks”: [ {“page”: “/home”, “referrer”: “”}, {“page”: “/products”, “referrer”: “/home”} ] }, { “session_time”: “20171013 14:06:00”, “clicks”: [ {“page”: “/contact”, “referrer”: “/home”} ] } >> SELECT c.page, COUNT(*) AS count FROM s3data.clickStream s, s.clicks c WHERE s.session_time > ‘2017-10-01 00:00:00’ AND c.referrer = “/home” GROUP BY c.page; Example: Find click frequency for links on “/home”:
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested data support Improve query performance by analyzing nested data OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 OrdersWithItems ItemID Quantity Price 23 10.00 12.50 16 1.00 1.99 32 1.00 5.60 24 5.00 26.50 OrderItems OrderID ItemID Quantity Price 5 23 10.00 12.50 8 32 1.00 5.60 5 16 1.00 1.99 8 24 5.00 26.50 OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 Orders OrderItems To improve query performance, the new Orders table includes the OrdersWithItems as a nested column, eliminating join processing
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming Soon: Enhanced Monitoring Optimize your Amazon Redshift cluster for peak performance by using query throughput metrics Get greater insights into your cluster performance by accessing database and workload metrics Get alerts and notifications via Amazon SNS Monitor query latency and throughput to optimize your workload
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Closing thoughts • Increase performance by 2x, at the same price, by switching to DC2 • Redshift Spectrum extends your Amazon Redshift cluster to all of your data in S3, seamlessly, efficiently and cost-effectively • Query Monitoring rules, along with Short Query Acceleration and Result Caching, can significantly improve performance Please continue to provide us feedback at redshift-pm@amazon.com
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fox Film Entertainment 21st Century Fox Data Lake on AWS
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fox Film Entertainment (21CF) CUSTOMER PLATFORMS BROADCASTERS PLATFORMS
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Landscape +100 TB Data +150 Billion Rows +100 Sources +25,000 User queries per day +35,000 Data process per day 24x7 All Regions
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Faster time to market Variety & Volume High Availability Stability Challenges Automation Technology & Beyond
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key Principles Data Democratization Cloud First Faster Time to Market Scale to Grow Total Cost of Ownership
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fox–AWS Architecture Collect Store Analyze Data Transfer Scheduled Ingest Data Lake (Object Storage) EDW/DM (SQL MPP) ETL E(L)T Spark Visualize & Analysis Catalog, Management, Security
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Services Collect Store Analyze Tag & TransferRaw data to S3 Visualize Amazon S3 AWS Glue Data Catalog Amazon EMR Amazon Redshift AWS Lambda Microstrategy AWS Glue ETL
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lessons Learned Design Considerations • Segregate workload/application into separate clusters • Scale up & down as needed by each application Analyze tables regularly • Every single load for 'PREDICATE COLUMNS’ • Weekly for all columns • Query SVV_TABLE_INFO(stats_off) to trigger ANALYZE Vacuum tables regularly • Daily vacuum on frequently accessed/modified table(STL_SCAN & STL_DELETE) • Weekly on all tables • Deep copy might be faster for high percent unsorted
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lessons Learned (cont.) Commit Queue • Amazon Redshift has higher commit overhead • ETL tool SQL push-down was creating too many commits. Had to work with vendor to optimize the commits • Optimize ETL design to batch up commits Schema Design • Choose the best DIST KEY and SORT KEY • Create Small tables as DIST ALL to optimize the table size and join performance • Sort Keys: Avoid having interleaved sort keys on frequently ETL’d tables • Sort Keys: Avoid compressing primary sort keys
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lessons Learned (cont.) Close engagement with AWS, ProServe, and APN Partners • Close partnering with our AWS account/support team, AWS product teams, AWS ProServe, and AWS ISV and SI partners allowed us to quickly address any issues that came up during migration • AWS ProServe brought deep experience and expertise to help accelerate our success WLM: • Dynamic WLM setting for resource prioritization and allocation (daytime vs. nighttime WLM settings) • Queue jobs at ETL and Reporting tool level to avoid submitting too many queries to Amazon Redshift at once.
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of migrating to Amazon Redshift Cost saved 15-20% of annual cost 21CF studios data center space reduction Performance 30-35% performance gain Business Agility • Streamlined provisioning of new gear • No longer have to deal with storage “wall” • Improved interoperability with native AWS services across multiple business units, leveraging our new data lake
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Looking ahead
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summarizing New Feature Benefits Amazon Redshift’s new DC2 node • Read/write performance is ~50% higher than dc1 Short Query Acceleration • Helped smooth out our overall performance and saw gains ~50% with MicroStrategy (MSTR) client Redshift Spectrum • Allows us to extend the reach of our Data Hub to ’cold’ data stored in Amazon S3 Query result set caching (pending) • Expected to yield 10x improvement on MSTR workload for cached responses (sub-second latencies)
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s next? Continue building the Data Lake (AWS Glue, Amazon Redshift Spectrum) Stream data processing with Amazon Kinesis Artificial Intelligence and Machine Learning • Use Amazon Redshift as a training source (Amazon Machine Learning, Spark) • Natural language interfaces (Amazon Lex, Amazon Polly)
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!

×