Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT202-R1) - AWS re:Invent 2018

1,975 views

Published on

Most companies are overrun with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we discuss the most common use cases with Amazon Redshift, and we take an in-depth look at how modern data warehousing blends and analyzes all your data to give you deeper insights to run your business. Intuit joins us to share their experience modernizing their analytics pipeline.  

  • Be the first to comment

Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT202-R1) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices A N T 2 0 2 - R 1 Maor Kleider Principal Product Manager Amazon Web Services Jason Rhoades Systems Architect Intuit
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Raise your hand if you’re using Amazon Redshift © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Databases and Analytics Broad and deep portfolio, built for builders AWS Marketplace Redshift Data warehousing EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL QuickSight SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Glacier Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement AnalyticsDatabases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Comprehend Rekognition Lex Transcribe DeepLens 250+ Solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data every 5 years There is more data than people think. years live for Data platforms need to scalegrows
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. There are more data types than ever before.
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hadoop Elasticsearch There are more ways to analyze data than ever before. Years ago 11 8 5 4 Presto Spark Didn’t exist
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What does data warehouse modernization mean? Easy to use Extends to your Data Lake Don’t waste time on menial administrative tasks and maintenance Directly analyze data stored in your data lake in open formats Any scale of data, workloads, and users Dynamically scale up to guarantee performance even with unpredictable demands and data volumes Faster time-to-insights Consistently fast performance, even with thousands of concurrent queries and users
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Fastest Get faster time-to-insight for all types of analytics workloads; powered by machine learning, columnar storage and MPP Unlimited scale Extends your Data Lake 1/10th the cost Dynamically scale up to guarantee performance even with unpredictable analytical demands and data volumes Analyze data in the Amazon S3 Data Lake in-place and in open formats, together with data loaded into Redshift’s high performance SSDs Start at $0.25 per hour, save costs with automated administration tasks and eliminate business impact due to downtime; as low as $1,000 per terabyte per year Fast, simple, cost-effective data warehouse that can extend queries to your Data Lake Analyze data in open formats such as Parquet, ORC, and JSON, using SQL tools
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. for their cloud data warehouse workloads than anyone else Amazon Redshift
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Selected Amazon Redshift Partners Data Integration Business Intelligence Systems Integrators
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift The 4 things that matter most Speed Scale SecuritySimplicity © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s dig into what we’ve done in the past several months and what’s coming…
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. features and enhancements released* Amazon Redshift is growing fast and innovating faster Automatically enabled short query acceleration Support for lateral column alias reference New Quick Starts New CloudWatch metrics Customized Recommendations with Advisor Current and trailing tracks for release update Federated authentication with single sign-on Improved performance for commits COPY from Parquet and ORC file formats Additional Spectrum regions Support for Scalar JSON and Ion data types Late materialization for faster query processing Support for DATE data type with Spectrum Short Query Acceleration Utilization reports Machine learning integration to accelerate dashboards and interactive analysis Improved resource management for memory-intensive queries Faster string manipulation Support for Parquet and ORC in Kinesis Data Firehose Improved workload management console experience Query Editor Support for late-binding views SQL Scalar user-defined functions Integration with AWS Glue Support for Nested Data with Spectrum Spectrum support for DATE data type Improved performance for UNION ALL queries Free upgrade from DC1 to DC2 RIs Query monitoring rules (QMR) Support for Zstandard high compression encoding Query processing improvements Support for Python UDF logging module Enhanced VPC routing Automatically hopping queries without restarts Support for uppercase column names Result Caching for Repeat Queries Support for LISTAGG DISTINCT Support for ORC and Grok file formats Integration with QuickSight DMS support with Redshift 3.5x Improved Throughput Improved performance for repeat queries Since we last spoke… *since re:Invent 2017
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift is now >3x faster than 6 months ago Normalized Queries Per Hour (QPH) Assuming Redshift’s QPH 6 months ago=100% Queriesperhour Asa%ofRedshift6monthsago JUL 2018 AUG 2018 SEP 2018 OCT 2018MAY 2018 100% 181% 237% 284% 350% Higher is better 115% JUN 2018
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. *Since re:Invent 2017 Compiled code cache Support for lateral column alias reference Resource management for memory-intensive queries Late materialization Result caching Joins involving large numbers of NULL values in a join key column Queries with intermediate subquery results that can be distributed Cluster resize operations Queries that refer to stable functions with constant expressions Short query acceleration Queries operating over CHAR and VARCHAR columns Single-row inserts Improvements to speed Expressions on the partition columns of external tablesFaster string manipulation Complex EXCEPT subqueries Commit processing enhancements DC2 nodes 2x the number of tables in a cluster Hash join memory utilization optimizations and cache line prefetching COPY operation when ingesting data from Parquet and ORC formats Performance improvement for queries that refer to stable functions over constant expressions Improvements for the COPY operation when ingesting data from Parquet and ORC formats Query processing improvements Query rewrites that pushdown selective joins into a subquery Query planning © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How we leverage fleet telemetry
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance improvements in query speed - Minero Aoki Senior Data Engineer, Cookpad Inc. Redshift query performance and scalability has been increasing, even though our data has grown. In the last 10 months, we have seen commit performance increase by 500% without any increase in cost.
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. “20 percent of our queries now complete in less than one second. Best of all, we didn’t have to change anything to get this speed-up with Redshift, which supports our mission-critical workloads.” -Greg Rokita, Executive Director of Technology, Edmunds
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Elastic Resize (GA) Adds additional nodes to Redshift cluster Distributes data across new configuration Minimal transition time Quickly scale for varying workload demands Scale up and down in minutes New! Redshift Cluster Redshift Managed S3 JDBC/ODBC Leader Node Backup
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Caching Layer Concurrency Scaling for bursts of user activity (Preview) Creates more clusters automatically on-demand Consistently fast performance even with thousands of concurrent queries No advance hydration required Handles unpredictable demand variability New! Backup Redshift Managed S3
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. For every 24 hours that your main cluster is in use, you accrue a one-hour credit for Concurrency Scaling. Concurrency Scaling is free for more than 97% of Redshift customers. Concurrency Scaling for bursts of user activity (Preview) New!
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. *Since re:Invent 2017 © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Improvements to simplicity CloudWatch metrics for Workload Execution Breakdown Current and trailing tracks for release updates Lateral column alias reference CloudWatch metrics for Query Duration by WLM Queues Cluster resize operations CloudWatch Query Runtime Breakdown metric Stream real-time data in Parquet or ORC formats using Kinesis Data Firehose DISTSTYLE AUTO distribution style Free upgrade from for DC1 RIs to DC2 Query Monitoring Rules (QMR) now support 3x more rules Short query acceleration is self-optimizing Redshift Advisor for best practice recommendationsCloudWatch metrics for Query Throughput by WLM Queues Cluster resize Query Editor Enhancements to VACUUM DELETE Manage components of a multi-part query in the AWS console Automatic vacuum delete Efficiency of backup performance CloudWatch metrics for Query Throughput, Query Duration
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Query Editor Query data directly from the AWS Console Results are instantly visible within the console No need to install an external JDBC/ODBC client Launched in October!
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift intelligent maintenance VacuumAnalyze WLM Concurrency Setting AutoAuto Auto Maintenance processes like vacuum and analyze will automatically run in the background. Redshift will automatically adjust the WLM concurrency setting to deliver optimal throughput. Moving towards zero-maintenance. Coming Soon!
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. *Since re:Invent 2017 © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Improvements to scale Integrate seamlessly with your data lake DATE data type Retrieving metadata for late-binding viewsSupport for Enhanced VPC Routing IN-list predicate processing in Spectrum scans Query external tables during a resize operation Specify the root of an S3 bucket as the source for an existing table Spectrum queries with aggregations on partition columns Renaming external table columns Table property to specify the file compression type for external tables Push the LENGTH() string function to Spectrum ALTER TABLE ADD/DROP COLUMN for external tables is now supported via standard JDBC calls Map datatypes in Spectrum to contain arrays Support for Parquet, ORC, Avro, CSV, and other open file formats New Spectrum regions Spectrum support for JSON and ION Spectrum support for nested data Arrays of arrays and arrays of maps
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Redshift Spectrum query engine Query across Redshift and S3 Redshift data S3 data lake Extend the data warehouse to exabytes of data in Amazon S3 Data Lake No data loading required Scale compute and storage separately Directly query data stored in Amazon S3 Parquet, ORC, Avro, JSON, and CSV data formats  Unload to Parquet  Spectrum Request Accelerator Coming Soon!
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift is Scalable Redshift Spectrum: Exabyte data lake query in under three minutes Compression Columnar file format Scanning with 2500 nodes Static partition elimination Dynamic partition elimination Amazon Redshift query optimizer * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data—generated for this demo based on data format used by Amazon Retail. Imagine you are the manager at a Seattle book store. An author released her 8th book in a popular series, and you need to figure out how many copies to order. Amazon S3 Redshift Spectrum <3 minutes 5X 10X 2,500X 2X 350X 40X Roughly 140 terabytes of customer item order detail records for each day over the past 20 years 190 million files across 15,000 partitions in S3 One partition per day for USA and rest of world Total data size is over an exabyte
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security is built-in Compliance certifications 10 GigE (HPC) Customer VPC Internal VPC JDBC/ODBC Compute Nodes Leader Node End-to-end encryption Integration with AWS Key Management Service
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The power of data lakes Most ways to bring data in Terabyte – Exabyte scale Security compliance, and audit capabilities Run any analytics on the same data without movement Scale storage and compute independently Designed for low-cost storage and analytics Redshift EMR Athena AI Services ElasticsearchKinesis Snowball Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Snowmobile
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Unload to Parquet Amazon Redshift New features Speed Scale WLM Concurrency Setting Simplicity Amazon Lake Formation integration Security Auto Data Distribution Deferred Maintenance Snapshot Scheduler Spectrum Request Accelerator Auto data distribution Elastic resize Concurrency Scaling Improving short query acceleration Auto- vacuum Auto- analyze
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. From desktop software to web-scale SaaS
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. From traditional datacenter to AWS November 2014 - Intuit announces it’s going “all-in” with AWS at re:Invent. July 2018 – With the end of its transition in sight, Intuit sells its major data center in Quincy, WA.
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Intuit’s growing cloud business platform Cloud cost optimization program $100s of Thousands Saved per day $100s of Millions Prepay under management +4 Billion Rows processed per day per node Time Progress is ~70% migrated to AWS. Focus is shifting from migration speed to efficient operations and growth.
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Our goals Handle explosive growth in data volume Maximize investment in value-add, not operations Provide deeper insights faster, fresher Maintain compliance with SOX regulations
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling challenges with our previous solution M rows per minute (~1M steady) 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0 6/4/20177:15:25PM 6/6/20176:38:29PM 6/8/20178:13:48PM 6/11/201711:57:59PM 6/14/201710:14:47PM 6/20/20175:05:26PM 6/22/201712:01:01AM 6/24/201712:17:26AM 6/29/20171:04:50AM 7/3/20172:21:43AM 7/6/20174:24:01AM 7/7/20177:54:18PM 7/11/201710:27:38PM 7/14/20175:35:55AM 7/16/201710:42:34PM 7/19/201712:23:23AM 7/20/201712:33:25PM 7/23/201711:33:26PM 7/26/20176:18:38AM 7/31/20175:26:14AM 8/3/20175:47:56AM 8/4/20177:22:16PM 8/7/20178:15:04PM 8/9/20178:56:14PM 8/11/20179:53:35PM 8/18/20172:14:43PM 8/21/201711:12:11PM 8/23/201711:42:26PM 8/26/20172:18:56AM 8/29/20171:02:18AM 8/31/20171:37:50AM 9/2/20172:15:31AM 9/4/20177:10.35PM 9/7/20177:53:13PM 9/10/20178:46:45PM 9/13/20175:37:06AM 9/14/201710:09:33PM 9/18/201711:16:19PM 9/21/201712:22:45AM 9/23/201712:52:31AM 9/27/20171:43:52AM 9/29/20171:35:27AM 10/2/20173:25:10AM 10/4/20177:42:26PM 10/6/20178:33:58PM 10/11/20175:46:08AM 10/12/20179:39:50PM 10/15/201710:35:26PM 10/18/201711:11:24PM 10/21/201712:05:27AM 10/24/20171:15:09AM 10/26/20171:52:47AM 10/28/20173:47:06AM 10/31/20173:16:54AM 11/2/20173:54:31AM Batch duration (Minutes) Batch size (M rows) Previous solution’s performance was constant, not accommodating increasing data volumes. Scaling in the datacenter took weeks and required significant manual effort and cost. 1000 0 900 800 700 600 500 400 300 200 100 1.5 0 1 .5
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift performance scales to demand Rows per Minute (Millions) 140 120 100 80 60 40 20 0 4500 0 4000 3500 3000 2500 2000 1500 1000 500 Batch duration (Minutes) Rows processed (Millions) 10/1/20176:06AM 10/4/20178:01PM 10/8/20176:47PM 10/12/20175:30PM 10/17/20178:10AM 10/22/20173:16PM 10/27/20176:26AM 11/2/20174:12AM 11/5/201710:12AM 11/10/20173:19AM 11/16/201710:22AM 11/19/201710:50PM 11/23/201710:35PM 11/28/201710:53AM 12/2/20177:09PM 12/6/20172:45AM 12/10/20179:41AM 12/14/20176:34AM 12/17/201719:33PM 12/21/20173:57AM 12/25/20179:14PM 12/29/20179:53AM 1/3/20183:00AM 1/8/20184:35PM 1/12/20187:34PM 1/16/20183:04PM 1/21/201812:37AM 1/26/20188:19PM 2/5/20187:44PM 2/9/20189:28AM 2/13/201812:52PM 2/17/20184:40PM 2/21/20186:18PM 3/1/20185:13AM 3/6/20182:58AM 3/10/20183:28PM 3/14/20184:39PM 3/18/20184:40AM 3/22/20187:07AM 3/30/20187:15AM 4/4/20187:10PM 4/8/20188:50PM 4/14/20183:39PM 4/18/20185:48PM 4/30/20181:54PM 5/6/20183:22PM 5/11/20184:55PM 5/16/201810:16AM 5/28/20184:12PM 6/4/201811:44AM 6/10/201811:42AM 6/18/201811:19PM 6/25/20182:00AM 7/1/20182:17AM 7/6/20182:51AM 7/10/20187:09PM 7/14/20184:14PM 7/26/20185:57PM 8/4/20183:52PM 8/8/20181:04AM 8/12/20181:17PM 8/18/201812:08AM 8/22/20187:25PM 8/30/20185:02AM 9/4/20182:50AM 9/7/20185:00PM 9/13/20187:02PM 9/18/20181:48AM 9/25/20186:25AM 9/29/201812:24AM
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Intuit cloud business architecture Data Demarcation Downstream Consumers AI/ML Visualizations Business Intelligence Stage Process Consume Data Platform Ingestion Processing Platform Processing Platform Orchestration Layer API
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Intuit cloud business services Amazon S3 Amazon S3 Amazon SageMaker Amazon QuickSight Amazon CloudWatch Amazon RDS AWS Step Functions Amazon SNS AWS Lambda Amazon EC2 AWS Lambda Amazon EC2 AWS Lambda Stage Process Consume Amazon Redshift
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons learned – orchestration applications Alternative warehouses (MS SQL, Oracle) provide all-in-one database application development platforms AWS provides an extensive collection of services to supplement Amazon Redshift The absence of system-native workflows can be intimidating at first. However, the broad collection of low-overheard compute, storage, and application development services provided by AWS allow for higher performing, more scalable, and lower cost solutions than previously possible. Amazon S3 AWS Snowball* AWS Batch AWS Lambda Amazon EC2 Amazon RDS AWS DMS Amazon CloudWatch AWS CloudTrail AWS Glue Amazon Kinesis Amazon EMR AWS Step Functions … and many more!
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons learned By linking Amazon Redshift with RDS PostgreSQL, the combined feature set can power a broader array of use cases and provide the best solution for each task. Amazon Redshift Fast, simple, cost-effective data warehouse that can extend queries to your data lake. Redshift strengths: High performance against large data sets Easily scaled MPP Platform Fast, simple ingestion from Amazon S3 PostgreSQL Amazon RDS PostgreSQL instances provide strong affinity to Redshift due to common PostgreSQL code roots. RDS PostgreSQL strengths: Performance for many small writes Stored Procedure support Additional Postgres 9.x features
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Next steps – Concurrency Scaling for key workloads Redshift Concurrency Scaling is expected to provide consistently fast performance for our analysts, even with thousands of concurrent queries. All of this with a minimal-to-no additional cost. Our platform’s next stage of intelligence and optimization will be derived from AI/ML applied against our data. Query patterns that are more complex and less predictable might increase the chances of concurrency conflicts with our key automated jobs. Further opening the system to internal data science teams means increasing Redshift analyst user base several-fold.
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Next steps – scaling with Amazon Redshift Unload historical data from our largest tables to Amazon S3 Using Unload to Parquet. Transparently query unloaded Amazon S3 data with Redshift-resident data using Redshift Spectrum. Performance excels for infrequently accessed data when Parquet’s columnar format is combined with the Redshift Spectrum Request Accelerator. These three features in concert allow one to seamlessly scale data outside of Amazon Redshift, increasing flexibility of storage and compute provisioning. Specifically, it will allow us to age older data out to S3, while keeping its retrieval seamless and performant.
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift migration benefits Performance & scaling Architecture has scaled over 7x data volume with no effort on our end >20x hardware- normalized performance with large batches Cost 66% reduction in operations overhead more than offsets slight Opex increase Business outcomes >90% reduction in time- to-insight 0 minutes of unscheduled downtime 50% reduction in story cycle time to implement new features
  45. 45. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Maor Kleider Maor@amazon.com Jason Rhoades Jason_Rhoades@intuit.com
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×