Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift

5,212 views

Published on

In this session, we take a deep dive on Amazon Redshift architecture and the latest performance enhancements that give you faster insights into your data. We also cover Redshift Spectrum, a feature of Redshift that enables you to analyze data across Redshift and your Amazon S3 data lake to deliver unique insights not possible by analyzing independent data silos. A customer is joining us to share how they were able to extend their data warehouse to their data lake to encompass multiple data sources and data formats. This modern architecture helps them tie together data sources to get actionable insights across their business units.

Published in: Technology
  • Doubled or Tripled in 5 weeks! Would recommend to anyone. ▲▲▲ http://t.cn/AiQ0txm6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Michalis Petropoulos Engineering Manager, Amazon Redshift Greg Rokita Executive Director, Edmunds BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Analytics Portfolio Collect Store Analyze Amazon Kinesis Data Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon DynamoDB Amazon ES Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon AI
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift 10x faster at 1/10th the cost Fast Delivers fast results for all types of workloads Cost-effective No upfront costs, start small, and pay as you go Integrated Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Integrated with Amazon S3 data lakes, AWS services, and third-party tools $ Simple Create and start using a data warehouse in minutes Scalable Gigabytes to petabytes to exabytes
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum Extend the data warehouse to your Amazon S3 data lake Scale compute and storage separately Join data across Amazon Redshift and S3 Exabyte-scale Amazon Redshift SQL queries against S3 Stable query performance and unlimited concurrency Parquet, ORC, JSON, Grok, Avro, & CSV formats Pay only for the amount of data scanned S3 data lakeAmazon Redshift data Redshift Spectrum query engine
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Redshift Spectrum Scale-out serverless compute Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY … Amazon Redshift Architecture
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thousands of Companies Run Mission Critical Workloads on Amazon Redshift
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. “Amazon Redshift has the largest adoption of BDW in the cloud.” “With more than 5,000 deployments, Amazon Redshift has the largest data warehouse deployments in the cloud – some over 10 petabytes in size.” AWS received a score of 5/5 (the highest score possible) in the: customer base, market awareness, ability to execute, road map, support, and partners criteria Forrester Wave Big Data Warehouse Q2 2017
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift is Widely Available Ireland Frankfurt London Paris Beijing Mumbai Seoul Singapore Sydney Tokyo Osaka Sao Paulo US East – N Virginia US East – Ohio US West – Oregon US West – N California AWS GovCloud (US) Canada – Central, Montreal
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Selected Amazon Redshift Partners Data Integration Systems IntegratorsBusiness Intelligence
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recently Released Features
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Customer Comments “We have terabytes of event data coming from our websites and applications to Amazon S3 and then to Amazon Redshift in near real-time. Redshift is at the core of our operations and used by our marketing automation tools,” said Jarno Kartela, Head of Analytics and Chief Data Scientist, DNA. “We can now run queries in half the time.” “Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, Liberty Mutual’s Technology Architect. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1." Finnish Telecom Service Provider
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dense Compute Nodes (DC2) 2x performance at the same price as DC1 3x more I/O with 30% better storage utilization than DC1 “Amazon Redshift’s new DC2 node is giving us a 100 percent performance increase, allowing us to provide faster insights for our retailers, more cost effectively, to drive incremental revenue." NVMe SSD DDR4 memory Intel E5-2686 v4 (Broadwell)
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Short Query Acceleration Express Lane for Short Queries • Short queries do not get stuck behind long running queries • Higher throughput – Less variability • Adapts to your workload • Transparent – it just works! Average Queue Time for Short Queries (<1sec)
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Short Query Acceleration Express Lane for Short Queries • Machine learning predicts the runtime of queries • Short queries are routed to an express queue • Resources are dynamically dedicated to short queries • Enable it today from your AWS Management Console • Coming soon: Dynamic timeout based on workload How it works: Analytics and BI / Dashboard tools Amazon Redshift Machine Learning Classifier Machine learning
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BI / Dashboard tools Analytics and Amazon Redshift Queries go to the leader node1 If the cache contains the query result, it is returned with no processing 2 If the query result is not in cache, it is executed, and the result is cached 3RESULTS CACHE QUERY_ID RESULT QUERY_ID RESULT Result-set Caching Subsecond repeat queries How it works: Result cache Caching frees up the Amazon Redshift cluster, increasing performance for all queries
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Result-set Caching Subsecond repeat queries • Amazon Redshift customers can now serve 35% more queries on average, using the same compute resources • Tens of thousands of compute hours are freed up daily to serve the remaining queries and data ingestion • Transparent – it just works!
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Commit Enhancements 50% faster data commits for busy clusters 16% faster data ingestion and insertion Commit Duration Per Transaction for Busy Clusters Nov Jan Mar Total Commit Time by Month ds2.8xlarge, cluster size: 10 and up, us-west-2 Clusters with more than 90 backups a day p99 p95 p90 p50 Linear (p99)
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Query Performance Improvements • Faster hash joins • Improvements to hash algorithm (Jan '18) • Significant improvement in memory utilization (Feb '18) • Cache line prefetching to improve join performance (Mar '18) • Join-intensive workloads like TPC-H and TPC-DS show a performance improvement ranging from 28% to 2x for several queries • 64x reduction of memory footprint fleet wide for hash joins and aggregations. Significant improvement to overall throughput • Read and write queries can now hop WLM queues without restarting
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum Enhancements • Available in 14 AWS Regions • Added support for processing scalar JSON and ION file formats in S3 • In addition to Parquet, ORC, Avro, CSV, Grok, RCFile, RegexSerDe, OpenCSV, SequenceFile, TextFile, and TSV • Support for DATE data type • Support for IAM role-chaining to assume cross-account roles • Coming Soon: COPY from Parquet, ORC, RCFile, and Sequence files
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested Data Support • Analyze nested and semi-structured data in Amazon S3 with Spectrum • Allows easy ETL of nested data in to Amazon Redshift using CTAS • Support for open file formats: Parquet, ORC, JSON, and Ion • Uses dot notation to extend your existing SQL s3data.clickStream: << { “session_time”: “20171013 14:05:00”, “clicks”: [ {“page”: “/home”, “referrer”: “”}, {“page”: “/products”, “referrer”: “/home”} ] }, { “session_time”: “20171013 14:06:00”, “clicks”: [ {“page”: “/contact”, “referrer”: “/home”} ] } >> SELECT c.page, COUNT(*) AS count FROM s3data.clickStream s, s.clicks c WHERE s.session_time > ‘2017-10-01 00:00:00’ AND c.referrer = “/home” GROUP BY c.page; Example: Find click frequency for links on “/home”:
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Coming soon: Nested Data Support Improve query performance by analyzing nested data OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 OrdersWithItems ItemID Quantity Price 23 10.00 12.50 16 1.00 1.99 32 1.00 5.60 24 5.00 26.50 OrderItems OrderID ItemID Quantity Price 5 23 10.00 12.50 8 32 1.00 5.60 5 16 1.00 1.99 8 24 5.00 26.50 OrderID CustomerID OrderTime ShipMode 5 23 10.00 12.50 8 32 1.00 5.60 Orders OrderItems To improve query performance, the new Orders table includes the OrdersWithItems as a nested column, eliminating join processing
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift is Self-Healing Machine-learning based prediction and remediation of degraded disks, nodes and network Ensure overall cluster and query performance Amazon Redshift ... ... ...
  23. 23. becoming data-driven
  24. 24. who is edmunds?
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. greg rokita Exec Director, Technology | M.S. in Computer Science | Founder
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. our history 1966 Edmunds is incorporated after being founded by Louis Arons and Michael Mayor, publishers of New Car Prices and Used Car Prices. 1988 Peter Steinlauf buys the company with AJA Holding Corp. Edmunds has 9 publications (7 automotive- related) selling for $3.95 each. 1995 Edmunds is first to publish car info on the internet. It evolves into the very first automotive information website — before any carmaker has one. 2007 Edmunds launches Dealer Ratings & Reviews. 2014 Edmunds introduces a proprietary messaging platform. Now car shoppers can text dealers directly. We call it CarCode. 2017 Edmunds unveils its new brand ecosystem and voice. Launches a re-imagined customer-centered site with new content. And forms new partnerships with a suite of new digital marketing services.
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. transformations
  28. 28. becoming data driven
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. batch data access Now - Data Warehouse • Structured, Processed • Schema-on-write • Expensive for large volume of data • Less agile • Used by business Future - Data Lake • Structured, Semi-structured, raw • Schema-on-read • Designed for low cost storage • Highly agile • Used by data scientists Trends • Processing power and storage getting cheaper • More use of data by Data Scientists • Volume of unstructured data is increasing
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. evolution of data warehousing — 2008 • High development cost • Cannot scale easily / costly • No separation of processing and access • Hard to find talent
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. evolution of data warehousing — 2011 Processing • Can scale and inexpensive • Talent issue somehow resolved Data Access • Expensive • High maintenance
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. evolution of data warehousing — 2014 Wins • Cost efficient data access • Low maintenance Challenges • Data and storage tightly coupled • No differential SLAs • Low flexibility Amazon Redshift
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IT-centric view of data Website/Apps Data Services/APIs Third-Party Data Data Warehouse Third-Party Data Analytics EAS Feed Import Load Load Read Track Feed
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. data-driven approach Website/Apps Analytics APIs Data Engineering Third-Party Data
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. blueprint Reporting Last Mile Processing Engine 1 Cluster A Engine 1 Cluster B Visual Analytics Transformations Engine 2 Cluster Engine 3 Cluster Use Case Layer Query/Processing Engine Layer Data Layer S3 Data (Parquet)
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Edmunds — AWS Beta Customer Amazon Redshift Copy from Parquet
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. commit performance optimizations, month over month
  38. 38. wins 39 8xlarge instances Performance for mission-critical workloads Result-set Caching 20% of queries under 1 sec Commit Performance Optimizations +50% Speedup overall, 1000s of hourly jobs Workflow Management Manage priorities within workloads Short Query Acceleration Automated prioritization for ad-hoc queries Amazon Redshift
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. data warehouse vs data engineering Use Case Layer Ad Hoc Analytics Reporting Real-Time Apps Metadata Management Data Science Legacy Map-Reduce Query/Processing Engine Layer Redshift Redshift Spark (Streaming) Scala/Java AWS Glue PySpark EMR Data Layer S3 Data (Parquet)
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 41
  41. 41. thank you © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the summit mobile app.
  43. 43. Submit Session Feedback 1. Tap the Schedule icon. 2. Select the session you attended. 3. Tap Session Evaluation to submit your feedback.
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Find out more: https://aws.amazon.com/redshift/ Try Amazon Redshift Get help with your Proof-of-Concept Read Amazon Redshift blog articles: https://aws.amazon.com/redshift/blog-posts/ Get Started With Amazon Redshift Amazon Redshift

×