We have run benchmarks to compare Redshift SSD instances to Redshift HDD instances. See our blog at https://flydata.com/blog/posts/with-amazon-redshift-ssd-querying-a-tb-of-data-took-less-than-10-seconds
Scaling API-first – The story of a global engineering organization
Amazon Redshift SSD - Queries on TBs of data can run in a few seconds
1. Amazon Redshift SSD
- Queries on TBs of data can
run in a few seconds
FlyData: Amazon Redshift
BENCHMARK Series 03
www.flydata.com
2. Amazon Redshift HDD took 33.32 seconds to run our
queries for 300GB data
Amazon Redshift SSD took 4.32 seconds to run our
queries for 300GB data
Amazon Redshift SSD performed 8X faster
Takeaways:
•1.2 TB can now be handled in under
10 seconds.
•Use cases could spread to ad-delivery
optimization and financial trading
systems.
www.flydata.com
3. Amazon Redshift is a popular data warehouse for
big data on the cloud. AWS added the SSD instance
type on January 24, 2014.
We have run benchmarks to compare Redshift SSD
instances to Redshift HDD instances using the
following parameters:
• Data Size: 1.2TB and 300GB
• Query performance when
querying against all records in the cluster
• Loading speed
• Cost comparison
www.flydata.com
4. 1. Query Speed for similar cluster sizes
• SSD version is
faster.
• Query against
1.2TB (entire data
set) took less than
10 seconds!
• For 1.2TB of data,
comparing similar
node sizes:
query time: 9.22s
(SSD) vs 28.48s
(HDD 8XLx2)
* See Appendix for queries being used.
Comparison of query speed against dw1.xlarge (HDD) and dw2.large (SSD) for 1.2TBs of data.
In order of cost
www.flydata.com
5. 1. Query Speed at similar pricing points
• Query performance comparison based
on similar pricing point.
• 4 nodes of dw2.large cost:
$0.25(/hour) * 4(nodes) = $1.00(/hour)
• 1 node of dw1.xlarge cost:
$0.85(/hour)
• Direct comparison is difficult, but we
can see much better query
performance for the dw2 (SSD)
Redshift.
* See Appendix for queries being used.
Comparison of query speed for cluster configurations with similar pricing for 300GB of data.
www.flydata.com
6. 2. Loading Time
• For similar cost
(DW2:$1.00/hour vs
DW1:$0.85/hour),
loading time was 4.6x
faster on SSD.
• For similar node sizes
(DW2:12 nodes vs
DW1:16 nodes),
loading time was
1.65x faster on SSD.
* See Appendix for queries being used.
Similar Cost Similar Node
Count
www.flydata.com
8. Summary
• Consider DW2 SSD Redshift
– If Query and Loading Performance is primary
and cost considerations are secondary
– If your data is smaller than 0.48TBs
• Consider DW1 HDD Redshift
– If current DW1 Redshift performance is
sufficient
– If DW2 costs are too expensive for your use
case
www.flydata.com
9. About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift,
with real-time data loading
– Automated ETL process with multiple supported
data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication
from RDBMS to Amazon Redshift
Contact us at: info@flydata.com
We are an official data
integration partner of
Amazon Redshift
www.flydata.com
11. Appendix: Data Loaded for Testing
TSV files, gzip compressed
Imp_lo
g
1) 300GB / 300M
record
2) 1.2TB / 1.2B record date datetime
publisher_id integer
ad_campaign_id integer
bid_price real
country varchar(30)
attr1-4 varchar(255)
click_l
og
1) 1.4GB / 1.5M
record
2) 5.6GB / 6M recorddate datetime
publisher_id integer
ad_campaign_id integer
country varchar(30)
attr1-4 varchar(255)
1) for 1 month
2) for 4
months
ad_campai
gn
100MB / 100k
record
publish
er
10MB / 10k
record
advertis
er
10MB / 10k
record
We used 5 tables to run a query which joins tables and creates a report.
www.flydata.com
12. Appendix: Sample Query
select
ac.ad_campaign_id as ad_campaign_id,
adv.advertiser_id as advertiser_id,
cs.spending as spending,
ims.imp_total as imp_total,
cs.click_total as click_total,
click_total/imp_total as CTR,
spending/click_total as CPC,
spending/(imp_total/1000) as CPM
from
ad_campaigns ac
join
advertisers adv
on (ac.advertiser_id = adv.advertiser_id)
join
(select
il.ad_campaign_id,
count(*) as imp_total
from
imp_logs il
group by
il.ad_campaign_id
) ims on (ims.ad_campaign_id =
ac.ad_campaign_id)
join
(select
cl.ad_campaign_id,
sum(cl.bid_price) as spending,
count(*) as click_total
from
click_logs cl
group by
cl.ad_campaign_id
) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click numbers,
advertiser spending, CTR, CPC and CPM. The query runs against all data in the
cluster.
www.flydata.com
14. Query Performance: Data Size = 300GB
Query Process
time(300GB) 4x DW2.large 1x DW1.xlarge
trial Sample Query Sample Query
1 9.05 58ignore
2 4.31 42.69
3 4.65 30.84
4 4.13 30.14
5 4.17 29.6
average 4.315 33.3175
(In seconds)
www.flydata.com
15. Appendix: Additional Information
• All resources for our benchmark are on
our github repository
– https://github.com/hapyrus/redshift-
https://github.com/hapyrus/redshift-
benchmark
– The dataset we use is open on S3, so you
can reproduce the benchmark
www.flydata.com
16. Summary: Amazon Redshift Pricing
• DW1: Amazon Redshift (HHD)
• DW2: Amazon Redshift (SSD)
– Cost is around 4x more expensive
– If storage need is less than 0.48TB, then DW2
is cheaper
16
www.flydata.com
17. Cost comparison:
1XL of DW1 (2TB),
4XL of DW2 (0.64TB) and 12XL of DW2 (1.92TB)
17
www.flydata.com
18. 18
x
x
For the same storage space,
DW2 SSD can be 5.2 times higher
www.flydata.com
21. Additional Comments
• SSD could be 3.5x ~ 5x more expensive than
HDD for the same amount of storage space
(SSD is really optimized for performance)
• DW1.8xlarge is exactly 8 times a DW1.xlarge,
but DW2.8xlarge is actually 16 times a
DW2.large. This is because DW2.large nodes
are not “xlarge”; a bit confusing… ;)
(as of Jan. 27, 2014)
www.flydata.com
22. www.flydata.com www.flydata.com
Check us out!
-> http://flydata.com
sales@flydata.com
Toll Free: 1-855-427-9787
http://flydata.com
We are an official data integration
partner of Amazon Redshift