This is an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs.
3. AnalyzeStore
Amazon
Glacier
Amazon S3
Amazon
DynamoDB
Amazon RDS,
Amazon Aurora
AWS big data portfolio
AWS Data Pipeline
Amazon
CloudSearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon
Machine
Learning
Amazon
Elasticsearch
Service
AWS Database
Migration Service
Amazon
QuickSight
Amazon
Kinesis
Firehose
AWS Import/Export
AWS Direct
Connect
Collect
Amazon Kinesis
Streams
4. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
5. The Amazon Redshift view of data warehousing
10x cheaper
Easy to provision
Higher DBA productivity
10x faster
No programming
Easily leverage BI tools,
Hadoop, machine learning,
streaming
Analysis inline with process
flows
Pay as you go, grow as you
need
Managed availability and
disaster recovery
Enterprise Big data SaaS
6. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ Enterprise Data Warehouse Q4 ’15
10. Benefit #1: Amazon Redshift is fast
Parallel and distributed
Query
Load
Export
Backup
Restore
Resize
11. Benefit #1: Amazon Redshift is fast
Hardware optimized for I/O intensive workloads, 4 GB/sec/node
Enhanced networking, over 1 million packets/sec/node
Choice of storage type, instance size
Regular cadence of auto-patched improvements
12. Benefit #1: Amazon Redshift is fast
New Dense Storage (HDD) instance type (Jun 15)
Improved memory 2x, compute 2x, disk throughput 1.5x
Cost: Same as our prior generation!
Performance improvement: 50%
Enhanced I/O and commit improvements (Jan 16)
Reduce amount of time to commit data
Throughput performance improvement: 35%
Improved memory allocation for query processing (May 16)
Increased overall throughput by up to 60%
13. Benefit #2: Amazon Redshift is inexpensive
DS2 (HDD)
Price per hour for
DS2.XL single node
Effective annual
price per TB compressed
On-demand $ 0.850 $ 3,725
1 year reservation $ 0.500 $ 2,190
3 year reservation $ 0.228 $ 999
DC1 (SSD)
Price per hour for
DC1.L single node
Effective annual
price per TB compressed
On-demand $ 0.250 $ 13,690
1 year reservation $ 0.161 $ 8,795
3 year reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
14. Benefit #3: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to Amazon S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
15. Benefit #3: Amazon Redshift is fully managed
Amazon S3
Amazon S3
Region 1
Region 2
Fault tolerance
Disk failures
Node failures
Network failures
Availability Zone/region level disasters
16. Benefit #4: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks and in S3 encrypted
• Block key, cluster key, master key (AES-256)
• On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
17. Benefit #5: We innovate quickly
Well over 100 new features added since launch
Release every two weeks
Automatic patching
18. Benefit #6: Redshift is powerful
• Approximate functions
• User defined functions
• Machine learning
• Data science
19. Benefit #7: Amazon Redshift has a large ecosystem
Data integration Systems integratorsBusiness intelligence
20. Benefit #8: Service oriented architecture
DynamoDB
EMR
S3
EC2/SSH
RDS/Aurora
Amazon
Redshift
Amazon Kinesis
Amazon ML
Data Pipeline
CloudSearch
Amazon
Mobile
Analytics
22. NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
23. 125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider
24. Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
25. 23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries
32. Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to
node
• Only charged for source cluster
36. Single Column
• Table is sorted by 1 column
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
[ SORTKEY ( date ) ]
• Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
37. Compound
• Table is sorted by 1st column , then 2nd column etc.
Date Region Country
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Korea
2-JUN-2015 Asia Singapore
2-JUN-2015 Europe Germany
3-JUN-2015 Asia Hong Kong
3-JUN-2015 Asia Korea
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
38. Interleaved
• Equal weight is given to each column.
Date Region Country
2-JUN-2015 Africa Zaire
3-JUN-2015 Asia Singapore
2-JUN-2015 Asia Korea
2-JUN-2015 Europe Germany
3-JUN-2015 Asia Hong Kong
2-JUN-2015 Asia Korea
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter
• Slowest to VACUUM
40. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
2
3
4
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
Round
Robin
DISTSTYLE EVEN
41. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
DISTSTYLE KEY
42. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
139 M Peter Black
446 M Pat Partridge
164 M Brian Snail
209 M James White
ID Gender Name
292 F Jane Jones
658 F Sarah Cyan
306 F Lisa Green
DISTSTYLE KEY
43. ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
ALL
DISTSTYLE ALL
44. • KEY
• Large Fact tables
• Large dimension tables
• ALL
• Medium dimension tables (1K – 2M)
• EVEN
• Tables with no joins or group by
• Small dimension tables (<1000)
49. Use multiple input files to maximize
throughput
Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
50. Use multiple input files to maximize
throughput
Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Welcome: Amazon Redshift day
11AM-Noon - Migrating to Amazon Redshift
Noon-1PM - Lunch Break
1PM-2PM - Deep Dive on Amazon Redshift
2PM-3PM - Streaming Data Analytics with Amazon Redshift and Kinesis Firehose
3PM-5PM - Hands-on Lab: Introduction to Amazon Redshift
This is the AWS Big Data portfolio. We have tools like Direct Connect and Import Export that can bring in a lot of data. We can persist that data into a number of storage services from S3 to DynamoDB to EMR and RedShift for further analysis.
Amazon Redshift provides a fast, fully managed, petabyte-scale data warehouse for less than $1000 per terabyte per year. Amazon Elastic MapReduce provides a managed, easy to use analytics platform built around the powerful Hadoop framework. Recently we announced Amazon Kinesis, a managed service for real-time processing of streaming big data. Amazon Glacier allows you to backup and archive an unlimited amount of data at just 1 cent per GB per month. Automate and schedule big data processing workloads with Data Pipeline.
The tools to support big data collection, computation along with collaboration and sharing are all available in a couple of clicks, with AWS.
Redshift value proposition: fast, simple, cost-effective data warehousing. Massively parallel and columnar technology. Easily scalable to petabytes or more. Makes administration simple, no DBA needed. You can choose from hard disk drive or solid state drive platform depending on your needs. All this for $1,000/TB/year, which is one-tenth of the cost of traditional data warehousing solutions.
Our view of data warehousing and why we built Amazon Redshift.
We want to make data warehousing fast, inexpensive and easy to use for companies of all sizes
For Enterprises, benefits are: 1/10th cost, easy to provision and manage, productive DBAs
For Big data companies: 10x faster, use standard SQL-no need to write custom code, use existing BI tools
For SaaS: analysis in-line with process flows, pay as you go, grow as you need, managed availability & DR
Amazon Redshift has been ranked as a leader in the recent data warehousing Forrester Wave
Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes
Yelp – 10s of millions of ads/day, 250M mobile events/day, analyzes ad campaign performance and new feature usage
Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
High level overview of Amazon Redshift Architecture – MPP based with leader node that acts as a SQL end point and coordinates query execution with compute nodes. The computes store data locally in columnar format.
Column storage fetches only those columns that are required for queries
Customer typically get 2-4x compression. Compression also implies less I/O
Zone maps: Each column is divided into 1MB blocks, we store the min/max values of each block in memory. We are able to quickly identify the blocks that are required to be fetched for a query
Direct-attached storage/large block sizes also enable fast I/O
All the operations that you care about from a database management and performance perspective are all parallelized, and hence scale horizontally with the number of nodes
We work with EC2 and infrastructure teams closely to optimize the h/w for data warehousing needs.
We work with EC2 and infrastructure teams closely to optimize the h/w for data warehousing needs.
Prices start at $1,000/TB/yr. for a 3 year DS2.8XL RI. RIs provide up to 70% discounts.
Backups are managed, continuous and incremental
Multiple copies are made within and outside the cluster on S3
Streaming restore =>While restore happens in the background, cluster is made available for reads/write within minutes of initiating the restoring operation (whether the backup is for a 1TB or a 100TB cluster)
Redshift monitors the health of a variety of components and failure conditions within an AZ and recovers from them automatically
For AZ/regional level disasters, streaming restore will help get back to business within minutes
From the time data is generated to the time it is queried, data can be encrypted end-to-end on Redshift
Pace of innovation – Continuous deployment, different model, Oracle/Teradata etc. still took a 6 month deployment plan vs. us
Python 2.7 – do you need support for other languages? Let us know
We respect the investments you made in your analytics platforms and work with a variety of ETL, BI and SI vendors. Standard JDBC/ODBC drivers make the integration process seamless but we work with these vendors closely to certify the joint platforms
We believe in SOA, pick and choose the components that work for your use case instead of bundling it all
Sorted columns enable fetching the minimum number of blocks required for query execution. In this example, an unsorted table al most leads to a full table scan O(N) and a sorted table leads to one block scanned O(1).
Compound – order of 1
Interleaved order of (N)^(1/3) for any column
Redshift works with customer’s BI tool of choice through Postgres drivers and a JDBC, ODBC connection. A number of partners shown here have certified integration with Redshift, meaning they have done testing to validate/build Redshift integration and make using Redshift easy from a UI perspective. If there are tools customer’s use not shown we can work on getting them integrated.
Console metrics
Table level restore
List of queries