Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - June 2017 AWS Online Tech Talks

1,540 views

Published on

Learning Objectives:
- Learn about Redshift Spectrum, a new feature that allows you to run Redshift queries directly against your data in Amazon S3 - Understand common use cases for Redshift Spectrum
- Identify strategies for improving performance and saving costs when querying your data in Amazon S3

Amazon Redshift Spectrum is a new feature that extends Amazon Redshift’s analytics capabilities beyond the data stored in your data warehouse to also query your data in Amazon S3. You can use Amazon Redshift and your existing business intelligence tools to run SQL queries against exabytes of data.

In this session, we will show you how you can easily start querying your data stored in Amazon S3 with Redshift Spectrum. You can run Amazon Redshift queries on Amazon S3 data on its own, or you can run queries that join together data in S3 with data already in your Redshift data warehouse. We will highlight technical details of query execution and implementation of Redshift Spectrum. We will talk about supported queries, data formats, and strategies to save cost by compressing or transforming your data into a columnar format.

Published in: Technology
  • Be the first to comment

Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - June 2017 AWS Online Tech Talks

  1. 1. June 7th, 2017 Introduction to Amazon Redshift Spectrum Maor Kleider, Sr Product Manager, Amazon Redshift ©2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  2. 2. What is Big Data? When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them
  3. 3. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generate over a PB/day It’s never been easier to generate vast amounts of data
  4. 4. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generating over PB/day Amazon S3 lets you collect and store all this data Store exabytes of data in S3
  5. 5. Generate Collect & Store Analyze Collaborate & Act Individual AWS customers generating over PB/day Highly Constrained But how do you analyze it? Store exabytes of data in S3
  6. 6. 1990 2000 2010 2020 Generated Data Available for Analysis Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Data Volume Year The Dark Data Problem Most generated data is unavailable for analysis
  7. 7. The tyranny of “OR” Amazon EMR Directly access data in S3 Scale out to thousands of nodes Open data formats Popular big data frameworks Anything you can dream up and code Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing
  8. 8. But I don’t want to choose. I shouldn’t have to choose I want “all of the above”
  9. 9. I want sophisticated query optimization and scale-out processing super fast performance and support for open formats the throughput of local disk and the scale of S3
  10. 10. I want all this From one data processing engine With my data accessible from all data processing engines Now and in the future
  11. 11. Amazon Redshift Spectrum
  12. 12. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  13. 13. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  14. 14. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2
  15. 15. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  16. 16. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  17. 17. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  18. 18. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  19. 19. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  20. 20. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  21. 21. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  22. 22. Running an analytic query over an exabyte in S3
  23. 23. Lets build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books she’s written. 1 Table 2 Filters SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE ‘%POTTER%’ AND P.AUTHOR = ‘J. K. Rowling’
  24. 24. Lets build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  25. 25. Lets build an analytic query - #3 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  26. 26. Lets build an analytic query - #4 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  27. 27. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  28. 28. Amazon Redshift Spectrum is fast Leverages Amazon Redshift’s advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster
  29. 29. Amazon Redshift Spectrum is cost-effective You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by: Partitioning data Using a columnar file format Compressing data
  30. 30. Amazon Redshift Spectrum is secure End-to-end data encryption Alerts & notifications Virtual private cloud Audit logging Certifications & compliance Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift PCI/DSSFedRAMP SOC1/2/3 HIPAA/BAA
  31. 31. Amazon Redshift Spectrum uses standard SQL Redshift Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key Date, Time and any other custom keys e.g., Year, Month, Day, Hour
  32. 32. Defining External Schema and Creating Tables Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore CREATE EXTERNAL SCHEMA <schema_name> Query external tables using <schema_name>.<table_name> Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE SCHEMA syntax CREATE EXTERNAL TABLE <table_name> [PARTITIONED BY <column_name, data_type, …>] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, …];
  33. 33. Amazon Redshift Spectrum – Current support File formats • Parquet • CSV • Sequence • RCFile • ORC (coming soon) • RegExSerDe (coming soon) Compression • Gzip • Snappy • Lzo (coming soon) • Bz2 Encryption • SSE with AES256 • SSE KMS with default key Column types • Numeric: bigint, int, smallint, float, double and decimal • Char/varchar/string • Timestamp • Boolean • DATE type can be used only as a partitioning key Table type • Non-partitioned table (s3://mybucket/orders/..) • Partitioned table (s3://mybucket/orders/date=YYYY-MM- DD/..)
  34. 34. Converting to Parquet and ORC using Amazon EMR You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source Or use Spark - 20 lines of Pyspark code, running on Amazon EMR • 1TB of text data reduced to 130 GB in Parquet format with snappy compression • Total cost of EMR job to do this: $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  35. 35. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  36. 36. The Emerging Analytics Architecture AthenaAmazon Athena Interactive Query AWS Glue ETL & Data Catalog Storage Serverless Compute Data Processing Amazon S3 Exabyte-scale Object Storage Amazon Kinesis Firehose Real-Time Data Streaming Amazon EMR Managed Hadoop Applications AWS Lambda Trigger-based Code Execution AWS Glue Data Catalog Hive-compatible Metastore Amazon Redshift Spectrum Fast @ Exabyte scale Amazon Redshift Petabyte-scale Data Warehousing
  37. 37. Over 20 customers helped preview Amazon Redshift Spectrum
  38. 38. Thank you! Questions?

×