Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pavan Pothukuchi, Principal Product Manager, AWS...
Agenda
• Introduction
• Benefits
• Use cases
• Redshift @ Coursera
• Q&A
AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
A...
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at...
The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programmi...
The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester R...
Selected Amazon Redshift customers
Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execut...
Benefit #1: Amazon Redshift is fast
Dramatically less I/O
Column storage
Data compression
Zone maps
Direct-attached storag...
Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, ove...
Benefit #1: Amazon Redshift is fast
New dense storage (HDD) instance type (Jun 2015)
Improved memory 2x, compute 2x, disk ...
Benefit #2: Amazon Redshift is inexpensive
Ds2 (HDD)
Price per hour for
DW1.XL single node
Effective annual
price per TB c...
Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and ...
Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node fail...
Benefit #4: Security is built in
• Load encrypted from Amazon S3
• SSL to secure data in transit
• ECDHE perfect forward s...
Benefit #5: We innovate quickly
Well over 125 new features added since launch
Release every two weeks
Automatic patching
S...
Benefit #6: Amazon Redshift is powerful
• Approximate functions
• User-defined functions
• Machine learning
• Data science
Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
Benefit #8: Service-oriented architecture
Amazon
DynamoDB
Amazon EMR
Amazon S3
Amazon EC2/SSH
Amazon RDS/
Amazon Aurora
Am...
Performance
Ease of use
Security
Analytics and
functionality
SOA
Recent launches Dynamic WLM parameters
Queue hopping for ...
Use cases
NTT Docomo: Japan’s largest mobile service
provider
68 million customers
Tens of TBs per day of data across a
mobile netwo...
NTT Docomo: Japan’s largest mobile service
provider
125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x fa...
Nasdaq: powering 100 marketplaces in 50
countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 ...
23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-mon...
Amazon Redshift @
Chris Liu – Data Infrastructure Engineer
Education at scale
140
partners
3.7 million
course completions
20 million
learners worldwide
1900
courses
Takeaways
• Simplicity
• Scalability
• Flexibility
• Extensibility
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem
• No warehouse: m2.2xlarge read replica – 4 CPUs,...
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem
• No warehouse: m2.2xlarge read replica – 4 CPUs,...
Outline
• Moving from no data warehouse to the Amazon
Redshift ecosystem
• No warehouse: m2.2xlarge read replica – 4 CPUs,...
Starting point
• Querying production read replica
• Makeshift libraries providing thin abstraction layer
• 45 minutes to p...
Starting point
• Querying production read replica
• Makeshift libraries providing thin abstraction layer
• 45 minutes to p...
Move in progress
• Risk-free deployment
• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
Move in progress
• Risk-free deployment
• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
• ...
Move in progress
• Risk-free deployment
• "Let's try it out"
• Few clicks to deploy cluster, connect to cluster, resize
• ...
Minimal administration
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Derivi...
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Derivi...
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI applications
Internal appli...
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data p...
• Provide directional insights and aggregate metrics
• Aggregate metrics over all classes on Coursera: < 5 seconds
• Goal:...
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data p...
• AB experimentation
• 300 M impression table joined with 1.8 B events table in 12 minutes.
• Recommendations model
• Amaz...
Amazon Redshift ecosystem at Coursera
• Data flow in and out of Amazon Redshift
• Business insights and reporting
• Data p...
Democratizing data access
Democratizing data access
Learnings from 3 years on Amazon Redshift
• Thinking in Amazon Redshift
• Communicate to users
• Surprises
• Reflections
Thinking in Amazon Redshift
• Columnar
• SELECT * considered harmful in most cases
Thinking in Amazon Redshift
• Columnar
• SELECT * considered harmful in most cases
• Nodes, slices, blocks
• 1 MB blocks p...
Thinking in Amazon Redshift
• Columnar
• SELECT * considered harmful in most cases
• Nodes, slices, blocks
• 1 MB blocks p...
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl...
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl...
Communicating to users
• Prefer the scientific method over gut feel
• Investigate how many rows were materialized with svl...
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/no...
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/no...
Surprises
• "Fundamental theorem of Redshift at Coursera"
• Most queries involve full table scans
• 9 nodes x 32 slices/no...
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal...
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal...
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal...
Reflections
• Simplicity – Relational model, Postgres 8.0 compliant SQL, things just
work. Minimal administration. Minimal...
Resources
Pavan Pothukuchi | pavanpo@amazon.com |
Chris Liu | cliu@coursera.org |
Detail pages
• http://aws.amazon.com/red...
Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
Upcoming SlideShare
Loading in …5
×

Getting Started with Amazon Redshift

1,324 views

Published on

Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.

Published in: Technology
  • You might get some help from ⇒ www.HelpWriting.net ⇐ Success and best regards!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Do This Simple 2-Minute Ritual To Loss 1 Pound Of Belly Fat Every 72 Hours  http://ishbv.com/bkfitness3/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Getting Started with Amazon Redshift

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pavan Pothukuchi, Principal Product Manager, AWS Chris Liu, Data Infrastructure Engineer, Coursera July 13, 2016 Getting Started with Amazon Redshift
  2. 2. Agenda • Introduction • Benefits • Use cases • Redshift @ Coursera • Q&A
  3. 3. AnalyzeStore Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS, Amazon Aurora AWS big data portfolio AWS Data Pipeline Amazon CloudSearch Amazon EMR Amazon EC2 Amazon Redshift Amazon Machine Learning Amazon Elasticsearch Service AWS Database Migration Service Amazon QuickSight Amazon Kinesis Firehose AWS Import/Export Snowball AWS Direct Connect Collect Amazon Kinesis Streams
  4. 4. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  5. 5. The Amazon Redshift view of data warehousing 10x cheaper Easy to provision Higher DBA productivity 10x faster No programming Easily leverage BI tools, Hadoop, machine learning, streaming Analysis inline with process flows Pay as you go, grow as you need Managed availability and disaster recovery Enterprise Big data SaaS
  6. 6. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. Forrester Wave™ enterprise data warehouse Q4 ’15
  7. 7. Selected Amazon Redshift customers
  8. 8. Amazon Redshift architecture Leader node Simple SQL endpoint Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB Ingestion/backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  9. 9. Benefit #1: Amazon Redshift is fast Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  10. 10. Benefit #1: Amazon Redshift is fast Parallel and distributed Query Load Export Backup Restore Resize
  11. 11. Benefit #1: Amazon Redshift is fast Hardware optimized for I/O intensive workloads, 4 GB/sec/node Enhanced networking, over 1 million packets/sec/node Choice of storage type, instance size Regular cadence of autopatched improvements
  12. 12. Benefit #1: Amazon Redshift is fast New dense storage (HDD) instance type (Jun 2015) Improved memory 2x, compute 2x, disk throughput 1.5x Cost: Same as our prior generation! Performance improvement: 50% Enhanced I/O and commit improvements (Jan 2016) Performance improvement: 35% Memory allocation improvements (May 2016) Performance improvement: 60%
  13. 13. Benefit #2: Amazon Redshift is inexpensive Ds2 (HDD) Price per hour for DW1.XL single node Effective annual price per TB compressed On demand $ 0.850 $ 3,725 1-year reservation $ 0.500 $ 2,190 3-year reservation $ 0.228 $ 999 Dc1 (SSD) Price per hour for DW2.L single node Effective annual price per TB compressed On demand $ 0.250 $ 13,690 1-year reservation $ 0.161 $ 8,795 3-year reservation $ 0.100 $ 5,500 Pricing is simple Number of nodes x price/hour No charge for leader node No upfront costs Pay as you go
  14. 14. Benefit #3: Amazon Redshift is fully managed Continuous/incremental backups Multiple copies within cluster Continuous and incremental backups to Amazon S3 Continuous and incremental backups across regions Streaming restore Amazon S3 Amazon S3 Region 1 Region 2
  15. 15. Benefit #3: Amazon Redshift is fully managed Amazon S3 Amazon S3 Region 1 Region 2 Fault tolerance Disk failures Node failures Network failures Availability Zone/region level disasters
  16. 16. Benefit #4: Security is built in • Load encrypted from Amazon S3 • SSL to secure data in transit • ECDHE perfect forward security • Amazon VPC for network isolation • Encryption to secure data at rest • All blocks on disks and in Amazon S3 encrypted • Block key, cluster key, master key (AES-256) • On-premises HSM and AWS CloudHSM support • Audit logging and AWS CloudTrail integration • SOC 1/2/3, PCI-DSS, FedRAMP, BAA 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  17. 17. Benefit #5: We innovate quickly Well over 125 new features added since launch Release every two weeks Automatic patching Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29)
  18. 18. Benefit #6: Amazon Redshift is powerful • Approximate functions • User-defined functions • Machine learning • Data science
  19. 19. Benefit #7: Amazon Redshift has a large ecosystem Data integration Systems integratorsBusiness intelligence
  20. 20. Benefit #8: Service-oriented architecture Amazon DynamoDB Amazon EMR Amazon S3 Amazon EC2/SSH Amazon RDS/ Amazon Aurora Amazon Redshift Amazon Kinesis Amazon ML AWS Data Pipeline Amazon CloudSearch Amazon Mobile Analytics
  21. 21. Performance Ease of use Security Analytics and functionality SOA Recent launches Dynamic WLM parameters Queue hopping for timed-out queries Merge rows from staging to prod. table 2x improvement in query throughput 10x latency improvement for UNION ALL queries Bzip2 format for ingestion Table level restore 10x improvement in vacuum perf. Default access privileges Tag-based AWS IAM access IAM roles for COPY/UNLOAD SAS connector enhancements, Implicit conversion of SAS queries to Amazon Redshift DMS support from OLTP sources Enhanced data ingestion from Kinesis Firehose Improved data schema conversion to Amazon ML
  22. 22. Use cases
  23. 23. NTT Docomo: Japan’s largest mobile service provider 68 million customers Tens of TBs per day of data across a mobile network 6 PB of total data (uncompressed) Data science for marketing operations, logistics, and so on Greenplum on premises Scaling challenges Performance issues Need same level of security Need for a hybrid environment
  24. 24. NTT Docomo: Japan’s largest mobile service provider 125 node DS2.8XL cluster 4,500 vCPUs, 30 TB RAM 2 PB compressed 10x faster analytic queries 50% reduction in time for new BI application deployment Significantly less operations overhead Data Source ET AWS Direct Connect Client Forwarder LoaderState management SandboxAmazon Redshift S3
  25. 25. Nasdaq: powering 100 marketplaces in 50 countries Orders, quotes, trade executions, market “tick” data from 7 exchanges 7 billion rows/day Analyze market share, client activity, surveillance, billing, and so on Microsoft SQL Server on premises Expensive legacy DW ($1.16 M/yr.) Limited capacity (1 yr. of data online) Needed lower TCO Must satisfy multiple security and regulatory requirements Similar performance
  26. 26. 23 node DS2.8XL cluster 828 vCPUs, 5 TB RAM 368 TB compressed 2.7 T rows, 900 B derived 8 tables with 100 B rows 7 man-month migration ¼ the cost, 2x storage, room to grow Faster performance, very secure Nasdaq: powering 100 marketplaces in 50 countries
  27. 27. Amazon Redshift @ Chris Liu – Data Infrastructure Engineer
  28. 28. Education at scale 140 partners 3.7 million course completions 20 million learners worldwide 1900 courses
  29. 29. Takeaways • Simplicity • Scalability • Flexibility • Extensibility
  30. 30. Outline • Moving from no data warehouse to the Amazon Redshift ecosystem • No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS • First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM
  31. 31. Outline • Moving from no data warehouse to the Amazon Redshift ecosystem • No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS • First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM • The Amazon Redshift ecosystem at Coursera • Current day: 9 dc1.8xl nodes – 288 CPUs, 2.4 TB RAM
  32. 32. Outline • Moving from no data warehouse to the Amazon Redshift ecosystem • No warehouse: m2.2xlarge read replica – 4 CPUs, 32 GB RAM on Amazon RDS • First Amazon Redshift cluster: 1 ds1.xl node – 2 CPU, 16 GB RAM • The Amazon Redshift ecosystem at Coursera • Current day: 9 dc1.8xl nodes – 288 CPUs, 2.4 TB RAM • Learnings from 3 years on Amazon Redshift • Lessons in communication, surprises, reflections
  33. 33. Starting point • Querying production read replica • Makeshift libraries providing thin abstraction layer • 45 minutes to provide aggregate metrics over all classes running on Coursera =(
  34. 34. Starting point • Querying production read replica • Makeshift libraries providing thin abstraction layer • 45 minutes to provide aggregate metrics over all classes running on Coursera =(
  35. 35. Move in progress • Risk-free deployment • "Let's try it out" • Few clicks to deploy cluster, connect to cluster, resize
  36. 36. Move in progress • Risk-free deployment • "Let's try it out" • Few clicks to deploy cluster, connect to cluster, resize • AWS ecosystem integration • COPY from S3/EMR/SSH • Unload to S3 • UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data
  37. 37. Move in progress • Risk-free deployment • "Let's try it out" • Few clicks to deploy cluster, connect to cluster, resize • AWS ecosystem integration • COPY from S3/EMR/SSH • Unload to S3 • UNLOAD(COPY(data)) == COPY(UNLOAD(data)) == data • Minimal administration • In aggregate, less than 1 full-time employee for administration • Automation and tooling for monitoring usage and performance
  38. 38. Minimal administration
  39. 39. Amazon Redshift ecosystem at Coursera • Data flow in and out of Amazon Redshift • Business insights and reporting • Deriving value from data • Democratizing data access
  40. 40. Amazon Redshift ecosystem at Coursera • Data flow in and out of Amazon Redshift • Business insights and reporting • Deriving value from data • Democratizing data access
  41. 41. Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event feeds Amazon EC2 Amazon RDS Amazon S3 BI applications Internal applications Third-party tools Cassandra Cassandra AWS Data Pipeline Data flow Amazon S3
  42. 42. Amazon Redshift ecosystem at Coursera • Data flow in and out of Amazon Redshift • Business insights and reporting • Data products • Democratizing data access
  43. 43. • Provide directional insights and aggregate metrics • Aggregate metrics over all classes on Coursera: < 5 seconds • Goal: Insight at the speed of thought • Results1: 0.8s median, 28s p95, 120s p99 • Companywide goal tracking • Scheduled reports to internal and external stakeholders • Crucial part of data informed culture 1 Results for ad hoc queries ran in the last 90 days Business insights and reporting
  44. 44. Amazon Redshift ecosystem at Coursera • Data flow in and out of Amazon Redshift • Business insights and reporting • Data products • Democratizing data access
  45. 45. • AB experimentation • 300 M impression table joined with 1.8 B events table in 12 minutes. • Recommendations model • Amazon Redshift for relational transformation • Unload to S3 for model training • Providing university partners with analytical dashboard and research exports Data Products
  46. 46. Amazon Redshift ecosystem at Coursera • Data flow in and out of Amazon Redshift • Business insights and reporting • Data products • Democratizing data access
  47. 47. Democratizing data access
  48. 48. Democratizing data access
  49. 49. Learnings from 3 years on Amazon Redshift • Thinking in Amazon Redshift • Communicate to users • Surprises • Reflections
  50. 50. Thinking in Amazon Redshift • Columnar • SELECT * considered harmful in most cases
  51. 51. Thinking in Amazon Redshift • Columnar • SELECT * considered harmful in most cases • Nodes, slices, blocks • 1 MB blocks per slice, n slices per node depending on node type
  52. 52. Thinking in Amazon Redshift • Columnar • SELECT * considered harmful in most cases • Nodes, slices, blocks • 1 MB blocks per slice, n slices per node depending on node type • Sorting and distribution • Share nothing massively parallel processing => data is sorted per slice • Up to 2 orders of magnitude increase in JOIN/GROUP BY for merge join vs hash join
  53. 53. Communicating to users • Prefer the scientific method over gut feel • Investigate how many rows were materialized with svl_query_report • Understand EXPLAIN plan for data distribution, join strategy, predicate order
  54. 54. Communicating to users • Prefer the scientific method over gut feel • Investigate how many rows were materialized with svl_query_report • Understand EXPLAIN plan for data distribution, join strategy, predicate order • SQL style guide for readability • Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps, conventions for table names, mapping tables
  55. 55. Communicating to users • Prefer the scientific method over gut feel • Investigate how many rows were materialized with svl_query_report • Understand EXPLAIN plan for data distribution, join strategy, predicate order • SQL style guide for readability • Leading commas, capitalized SQL keywords, conventions for handling dates/timestamps, conventions for table names, mapping tables • Use the right tool for the right task • Amazon Redshift is not for online traffic serving • Amazon Redshift is not for stream processing
  56. 56. Surprises • "Fundamental theorem of Redshift at Coursera" • Most queries involve full table scans • 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column • Store ~75 M integer values and maintain 1 block per slice
  57. 57. Surprises • "Fundamental theorem of Redshift at Coursera" • Most queries involve full table scans • 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column • Store ~75 M integer values and maintain 1 block per slice • Features may behave in unexpected fashions • Sort key compression • Primary and foreign keys
  58. 58. Surprises • "Fundamental theorem of Redshift at Coursera" • Most queries involve full table scans • 9 nodes x 32 slices/node x 1 block/slice x 1 MB/block => at least 288 MB allocated per column • Store ~75 M integer values and maintain 1 block per slice • Features may behave in unexpected fashions • Sort key compression • Primary and foreign keys • Features may be unexpectedly expensive • COMMIT – Batch work, monitor with stl_commit_stats • VACUUM – Prefer TRUNCATE, monitor with stl_vacuum, stl_query • Your mileage may vary
  59. 59. Reflections • Simplicity – Relational model, Postgres 8.0 compliant SQL, things just work. Minimal administration. Minimal tuning.
  60. 60. Reflections • Simplicity – Relational model, Postgres 8.0 compliant SQL, things just work. Minimal administration. Minimal tuning. • Scalability – Scaled cluster up 5 times in the last 3 years as data volume and usage increased.
  61. 61. Reflections • Simplicity – Relational model, Postgres 8.0 compliant SQL, things just work. Minimal administration. Minimal tuning. • Scalability – Scaled cluster up 5 times in the last 3 years as data volume and usage increased. • Flexibility – No strict requirement on data modeling; dusty knobs for tuning in majority of cases. Handles both heavily normalized data model and denormalized clickstream data.
  62. 62. Reflections • Simplicity – Relational model, Postgres 8.0 compliant SQL, things just work. Minimal administration. Minimal tuning. • Scalability – Scaled cluster up 5 times in the last 3 years as data volume and usage increased. • Flexibility – No strict requirement on data modeling; dusty knobs for tuning in majority of cases. Handles both heavily normalized data model and denormalized clickstream data. • Extensibility – Standard API (JDBC/ODBC/libpq) and integration points.
  63. 63. Resources Pavan Pothukuchi | pavanpo@amazon.com | Chris Liu | cliu@coursera.org | Detail pages • http://aws.amazon.com/redshift • https://aws.amazon.com/marketplace/redshift/ Best practices • http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html • http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html • http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html Related breakout sessions • Deep Dive on Amazon QuickSight (2:15–3:15 pm) • Getting Started with Amazon QuickSight (2:15–3:15 pm) • Database Migration: Simple, Cross-Engine and Cross-Platform Migrations with Minimal Downtime (4:45–5:45 pm)

×