Injustice - Developers Among Us (SciFiDevCon 2024)
Scalability of Amazon Redshift Data Loading and Query Speed
1. Hapyrus: Amazon Redshift
BENCHMARK Series 02
Scalability of Amazon Redshift
Data Loading and
Query Speed
Comparisons between the performance of different instances
www.flydata.com
2. Amazon Redshift can load 1.2TB data using:
•an XL instance, taking 17 hours
•a two node XL instance, taking 10 hours
•a two node 8XL instance (equivalent to XL 16x), taking 2 hours
Load speed is almost proportional to
number of nodes
Running identical queries on Amazon Redshift,
•an XL instance took 155 seconds
•a two node XL instance took 55 seconds
•a two node 8XL instance (equivalent to XL 16x) took 31 seconds
The more nodes, the faster a query runs (but
not by much, seems more inversely proportional)
www.flydata.com
3. The most important feature of Amazon Redshift is
that it has flexible scalability compared to other
columnar data warehouses such as Vertica,
Netezza and Teradata.
We have run benchmarks to compare the cost and
speed of Redshift instances using:
• 1.2TB of data and identical queries
– same data as previous benchmark
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-
faster-and-cheaper-than-hadoop-same data as previous
benchmark http://www.slideshare.net/Hapyrus/amazon-
redshift-is-10x-faster-and-cheaper-than-hadoop-hive
• 3 different types of instances: Single node
XL, two node XL, and two node 8XL
www.flydata.com
4. 1. Data Loading
• XL x1 instance
took 17 hours
• XL x2 instance
took 10 hours
• 8XL x2 instance
(equivalent to XL
16x) took 2
hours
* The query and data used can be referenced in our Appendix
2h 36m
17h 19m
10h 38m
Data loading by COPY command. All the data is copied at once.
2
www.flydata.com
5. 2. Query Speed
• XL x1 instance
took 155
seconds
• XL x2 instance
took 55 seconds
• 8XL x2 instance
(equivalent to XL
16x) took 31
seconds
* The query used can be referenced in our Appendix
155.02s
54.77s
30.61s
2
www.flydata.com
6. Redshift Data Loading Results
* The data used can be referenced in our Appendix
The biggest data (imp_logs:
1.2TB)
Other Tables
Instance Type Nodes Machine Spec Time (hours) Speed Cost per TB
dw.hs1.xlarge
1 1 17.32 70MB/h $12.27
2 2 10.63 110MB/h $15.07
dw.hs1.8xlarge 2 16 2.62 460MB/h $29.64
Loading Time
Table Name XL x1 XL x2 8XL x2
ad_campaigns
(100MB)
5.82s 8.97s 8.5s
Publishers
(10MB)
552ms 2.83s 2.22s
Advertisers
(10MB)
770ms 3.71s 4.75s
imp_logs
(1.2TB)
17h 19m 22.27s 10h 38m 5.66s 2h 36m 55.49s
click_logs
(6GB)
11m 31.37s 10m 6.24s 37.87s
www.flydata.com
7. Redshift Query Results
* The query used can be referenced in our Appendix
Processing Time
Trial XL x1 XL x2 8XL x2
1 2m 28.65s 1m 1.44s 39.11s
2 2m 37.65s 52.89s 26.77s
3 2m 35.91s 53.76s 29.9s
4 2m 29.04s 53.52s 27.51s
5 2m 36.9s 52.22s 29.75s
Average 2m 35s 54.77s 30.69s
www.flydata.com
8. Discussion
• Redshift can load data in parallel
– Almost proportional to number of nodes
• Redshift query speed is drastically affected
by number of nodes
– Big difference between Single and Multiple nodes
– Two nodes are faster than two times the speed of
one node
– Additional experimentation may be helpful in
future
www.flydata.com
9. APPENDIX - Data
TSV files, gzip compressed
Imp_lo
g
1.2TB / 1.2B
record
date datetime
publisher_id integer
ad_campaign_id integer
bid_price real
country varchar(30)
attr1-4 varchar(255)
click_l
og
5.6GB / 6M
record
date datetime
publisher_id integer
ad_campaign_id integer
country varchar(30)
attr1-4 varchar(255)
ad_campai
gn
100MB / 100k
record
publish
er
10MB / 10k
record
advertis
er
10MB / 10k
record
Using 5 tables, we run a query which join tables and creates a report.
www.flydata.com
10. appendix – Sample Query
select
ac.ad_campaign_id as ad_campaign_id,
adv.advertiser_id as advertiser_id,
cs.spending as spending,
ims.imp_total as imp_total,
cs.click_total as click_total,
click_total/imp_total as CTR,
spending/click_total as CPC,
spending/(imp_total/1000) as CPM
from
ad_campaigns ac
join
advertisers adv
on (ac.advertiser_id = adv.advertiser_id)
join
(select
il.ad_campaign_id,
count(*) as imp_total
from
imp_logs il
group by
il.ad_campaign_id
) ims on (ims.ad_campaign_id =
ac.ad_campaign_id)
join
(select
cl.ad_campaign_id,
sum(cl.bid_price) as spending,
count(*) as click_total
from
click_logs cl
group by
cl.ad_campaign_id
) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click
numbers,
advertiser spending, CTR, CPC and CPM.
www.flydata.com
11. APPENDIX – Additional Comments
• A Redshift 8XL instance cannot be selected as
single node (you must choose at least 2 nodes)
(as of Feb. 24,
2013)
www.flydata.com
12. APPENDIX – Additional Information
• All resources for our benchmark are on
our github repository
– https://github.com/hapyrus/redshift-
https://github.com/hapyrus/redshift-
benchmark
– The dataset we use is open on S3, so you
can reproduce the benchmark
www.flydata.com
13. About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift,
with real-time data loading
– Automated ETL process with multiple supported
data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication
from RDBMS to Amazon Redshift
Contact us at: info@flydata.com
We are an official data
integration partner of
Amazon Redshift
Formerly known as Hapyrus
www.flydata.com
14. www.flydata.com www.flydata.com
Check us out!
-> http://flydata.com
sales@flydata.com
Toll Free: 1-855-427-9787
http://flydata.com
We are an official data integration
partner of Amazon Redshift