2. Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Rahul Pathak |Senior Product Manager
3. a lot faster
a lot cheaper
a whole lot simpler
Petabyte scale
Massively parallel
Amazon
Redshift
Relational data warehouse
Fully managed; zero admin
5. Amazon Redshift architecture
•
Leader Node
–
–
–
•
SQL endpoint
Stores metadata
Coordinates query execution
JDBC/ODBC
Compute Nodes
–
–
–
–
Local, columnar storage
Execute queries in parallel
Load, backup, restore via
Amazon S3
Parallel load from Amazon DynamoDB
•
Hardware optimized for data processing
•
Two hardware platforms
10 GigE
(HPC)
–
–
DW1: HDD; scale from 2TB to 1.6PB
DW2: SSD; scale from 160GB to 256TB
Ingestion
Backup
Restore
6. Amazon Redshift has security built-in
•
•
Customer VPC
SSL to secure data in transit
Encryption to secure data at rest
–
–
–
JDBC/ODBC
AES-256; hardware accelerated
All blocks on disks and in Amazon S3
encrypted
HSM Support
•
10 GigE
(HPC)
No direct access to compute nodes
•
Internal
VPC
Audit logging & AWS CloudTrail
integration
•
Amazon VPC support
Ingestion
Backup
Restore
7. Amazon Redshift is easy to use
•
Provision in minutes
•
Monitor query performance
•
Point and click resize
•
Built in security
•
Automatic backups
10. Point and click resize
•
Resize while remaining online via AWS
Console or API
•
Provision a new cluster in the background
and copy data in parallel from node to
node
•
Only charged for source cluster until SQL
endpoint has automatically been switched
over via DNS
11. Amazon Redshift continuously backs up your data and
recovers from failures
•
Replication within the cluster and backup to Amazon S3 to maintain multiple
copies of data at all times
•
Backups to Amazon S3 are continuous, automatic, and incremental
–
Designed for eleven nines of durability
•
Continuous monitoring and automated recovery from failures of drives and nodes
•
Able to restore snapshots to any Availability Zone within a region
•
Easily enable backups to a second region for disaster recovery
12. Amazon Redshift integrates with multiple data sources
Corporate Datacenter
DynamoDB
Amazon Redshift
Amazon S3
Amazon RDS
Amazon EMR
13. New Features That Introduced After re:Invent 2013
re:Invent 2013以降の主なアップデート
15. Summary of Updates after re:Invent
•
Amazon Redshift - New Features Galore (2013/11/11)
–
–
–
–
–
–
–
–
•
•
•
Distributed Tables - You now have more control over the distribution of a table's rows across compute
nodes.
Remote Loading - You can now load data into Redshift from remote hosts across an SSH connection.
Approximate Count Distinct - You can now use a variant of the COUNT function to approximate the
number of matching rows.
Workload Queue Memory Management - You can now apportion available memory across work
queues.
Key Rotation - You can now direct Redshift to rotate keys for an encrypted cluster.
HSM Support - You can now direct Redshift to use an on-premises Hardware Security Module (HSM) or
AWS CloudHSM to manage the encryption master and cluster encryption keys.
Database Auditing and Logging - You can log connections and user activity to Amazon S3.
SNS Notification - Redshift can now issue notifications to an Amazon SNS topic when certain events
occur.
Automated Cross-Region Snapshot Copy for Amazon Redshift (2013/11/14)
Faster & More Cost-Effective SSD-Based Nodes for Amazon Redshift(2014/01/24)
AWS CloudFormation Adds Support for Redshift and More (2014/02/10)
16. Amazon Redshift Node Types
DW1.XL: 16 GB RAM, 2 Cores
3 Spindles, 2 TB compressed storage
•
Optimized for I/O intensive workloads
•
High disk density
DW1.8XL: 128 GB RAM, 16 Cores, 24
Spindles 16 TB compressed, 2 GB/sec scan
rate
•
On demand at $0.85/hour
•
As low as $1,000/TB/Year
•
Scale from 2TB to 1.6PB
DW2.L *New*: 16 GB RAM, 2 Cores,
160 GB compressed SSD storage
•
High performance at smaller storage size
•
High compute and memory density
•
On demand at $0.25/hour
•
As low as $5,500/TB/Year
•
Scale from 160GB to 256TB
DW2.8XL *New*: 256 GB RAM, 32 Cores,
2.56 TB of compressed SSD storage
17. Amazon Redshift is priced to let you analyze all your data
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand
$ 1.250
$ 5,475
1 Year Reservation
$ 0.750
$ 3,283
3 Year Reservation
$ 0.452
$ 1,981
DW1 (HDD)
Effective Annual
Price per TB
On-Demand
$ 0.330
$ 18,068
1 Year Reservation
$ 0.211
$ 11,570
3 Year Reservation
$ 0.130
$ 7,127
No charge for leader node
•
Price Per Hour for
DW2.L Single Node
Number of nodes x cost per
hour
•
DW2 (SSD)
•
No upfront costs
•
Pay as you go
30. Common Customer Use Cases
Traditional Enterprise DW
SaaS Companies
•
Improve performance by
an order of magnitude
•
Add analytic functionality
to applications
Make more data
available for analysis
•
Scale DW capacity as
demand grows
•
•
•
•
•
Reduce costs by
extending DW rather than
adding HW
Companies with Big Data
Access business data via
standard reporting tools
•
Reduce HW & SW costs
by an order of magnitude
Migrate completely from
existing DW systems
Respond faster to
business; provision in
minutes
32. Japanese Redshift Customer – ALBERT
•
Business Challenge
–
•
Why AWS?
–
–
•
Given their data volumes, RDBMS tuning and archiving was causing them a lot of
operational pain and costing them money
Amazon Redshift’s performance and ability to handle large data sets allowed them to
make it the core engine of their analytics, enabling them to provide a private DMP (Data
Management Platform) for their customers on the Cloud
PostgreSQL is their primary RDBMS, and connectivity by PostgreSQL drivers is big technical
advantage to choose Redshift.
Benefits for their business
–
–
Ability to start small and scale as needed
Scalability and flexibility dramatically lowered the cost of ownership
33. Japanese Redshift Customer – Sansan
•
Business Challenge
– Since “Eight” is business card management solution for consumers, they
needed infrastructure that could start small and scale as needed
•
Why AWS?
– When they tried out AWS first, they were surprised with the ease of use. AWS
functionality and elasticity were critical factors
•
Benefits for their business
– Lower costs substantially using reserved instances
– Automation is a key to reduce operational and administration costs. They
utilize services such as Amazon SES and Amazon SWF.
– They use Redshift for KPI analytics of their services.
35. Multiple Data Loading Options
Data Integration
•
Parallel upload to Amazon S3
•
AWS Direct Connect
•
AWS Import/Export
•
ETL Software
•
Systems integrators
Systems Integrators
36. Customers on Performance
“Redshift is twenty times faster than Hive” (5x – 20x reduction in query times) link
…[Redshift] performance has blown away everyone here (we generally see 50-100x speedup
over Hive). link
“We saw…2x improvement in query times and a 50% reduction in costs”
We regularly process multibillion row datasets and we do that in a matter of hours. link
“Queries that used to take hours came back in seconds. Our analysts are orders of magnitude
more productive.” (20x – 40x reduction in query times) link
“Did I mention it's ridiculously fast? We'll be using it immediately to provide our analysts
an alternative to Hadoop.”
37. Customers on Cost
“We found that Amazon Redshift offers the performance we needed while freeing us from
the licensing costs of our previous solution” link
“[Redshift] cost saving is even more impressive…Our analysts like [Redshift] so much they
don’t want to go back.” (4x reduction in cost over HIVE) link
“We saw 50% reduction in costs”
“Not only did we avoid 3 months of development work [we] saved approximately $80,000 in
labor…Competitive Advantage realized with just a few clicks.”
“[Amazon Redshift] took an industry famous for its opaque pricing, high TCO and unreliable
results and completely turned it on its head.” link
“[Redshift] has reduced our storage and processing costs significantly, helping us to realize
another 60-70 percent savings.” link
38. Customer on Ease of Use
“With Amazon Redshift and Tableau, anyone in the company can set up any queries they
like…It’s very flexible.” link
“Compared to Hadoop [Redshift] is much easier for analysts to use. What may have been a
Hadoop project can become just a query in Redshift.” link
“We can spin up an Amazon Redshift cluster, take a snapshot, and scale servers in minutes
instead of days.” link
“…our team was able to provision Redshift in a matter hours vs. weeks with on-premises
servers.”
“Amazon Redshift is simple to use and reliable. With one click, we can rapidly scale up or down
in real time in alignment with business requirements.” link
“Customers can get consistent, accurate, and useful data fast - in weeks not months or years.”
link
39. AWS Marketplace
•
Find software to use with Amazon
Redshift
•
One-click deployments
•
Flexible pricing options
http://aws.amazon.com/marketplace
42. Resources
•
Detail Pages
–
–
•
New Features
–
–
•
http://docs.aws.amazon.com/redshift/latest/dg/doc-history.html
http://docs.aws.amazon.com/redshift/latest/mgmt/document-history.html
Best Practices
–
–
–
•
http://aws.amazon.com/redshift
https://aws.amazon.com/marketplace/redshift/
http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Presentations & Webinars:
–
–
–
http://www.youtube.com/watch?v=JxLpj_TnisM (2013 SF Summit Presentation)
http://www.youtube.com/watch?v=R1m-fwzXMow (Best Practices 1 of 2)
http://www.youtube.com/watch?v=7ySzRTOyK6o (Best Practices 2 of 2)
43. Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Age
State
Amount
20
CA
500
345
25
WA
250
678
•
ID
123
•
40
FL
125
37
WA
375
•
Zone maps
957
•
Direct-attached storage
•
With row storage you do
unnecessary I/O
•
To get total amount, you have to
read everything
44. Amazon Redshift dramatically reduces I/O
Column storage
Data compression
Age
State
Amount
20
CA
500
345
25
WA
250
678
•
ID
123
•
40
FL
125
37
WA
375
•
Zone maps
957
•
Direct-attached storage
•
With column storage, you only
read the data you need
45. Amazon Redshift dramatically reduces I/O
•
Column storage
analyze compression listing;
Table |
Column
| Encoding
---------+----------------+---------listing | listid
| delta
listing | sellerid
| delta32k
listing | eventid
| delta32k
listing | dateid
| bytedict
listing | numtickets
| bytedict
listing | priceperticket | delta32k
listing | totalprice
| mostly32
listing | listtime
| raw
•
Data compression
•
Zone maps
•
Direct-attached storage
•
COPY compresses automatically
•
You can analyze and override
•
More performance, less cost
Slides not intended for redistribution.
46. Amazon Redshift dramatically reduces I/O
•
Column storage
10
324
•
Data compression
375
623
•
Zone maps
•
Direct-attached storage
637
959
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
•
Track the minimum and maximum
value for each block
•
Skip over blocks that don’t contain
relevant data
47. Amazon Redshift dramatically reduces I/O
•
Column storage
•
Use local storage for performance
•
Maximize scan rates
•
Data compression
•
Zone maps
•
Automatic replication and
continuous backup
•
Direct-attached storage
•
HDD & SSD platforms
49. Amazon Redshift parallelizes and distributes everything
•
Query
•
Load
•
Backup/Restore
•
Resize
•
Load in parallel from Amazon S3 or
Amazon DynamoDB or any SSH
connection
•
Data automatically distributed and
sorted according to DDL
•
Scales linearly with number of nodes
50. Amazon Redshift parallelizes and distributes everything
•
Query
•
Load
•
Backup/Restore
•
Backups to Amazon S3 are automatic, continuous
and incremental
•
Resize
•
Configurable system snapshot retention period. Take
user snapshots on-demand
•
Cross region backups for disaster recovery
•
Streaming restores enable you to resume querying
faster
51. Amazon Redshift parallelizes and distributes everything
•
Query
•
Load
•
Backup/Restore
•
Resize
•
Resize while remaining online
•
Provision a new cluster in the background
•
Copy data in parallel from node to node
•
Only charged for source cluster
52. Amazon Redshift parallelizes and distributes everything
•
Query
•
Load
•
Backup/Restore
•
•
Automatic SQL endpoint switchover
via DNS
•
Decommission the source cluster
•
Simple operation via Console or API
Resize