Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Scalability of Amazon Redshift Data Loading and Query Speed

11,835

Published on

Our blog post: http://www.flydata.com/blog/posts/scalability-of-amazon-redshift-data-loading-and-query-speeds

Our blog post: http://www.flydata.com/blog/posts/scalability-of-amazon-redshift-data-loading-and-query-speeds

Published in: Technology, Business
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,835
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
119
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hapyrus: Amazon Redshift BENCHMARK Series 02 Scalability of Amazon Redshift Data Loading and Query Speed Comparisons between the performance of different instances www.flydata.com
  • 2. Amazon Redshift can load 1.2TB data using: •an XL instance, taking 17 hours •a two node XL instance, taking 10 hours •a two node 8XL instance (equivalent to XL 16x), taking 2 hours Load speed is almost proportional to number of nodes Running identical queries on Amazon Redshift, •an XL instance took 155 seconds •a two node XL instance took 55 seconds •a two node 8XL instance (equivalent to XL 16x) took 31 seconds The more nodes, the faster a query runs (but not by much, seems more inversely proportional) www.flydata.com
  • 3. The most important feature of Amazon Redshift is that it has flexible scalability compared to other columnar data warehouses such as Vertica, Netezza and Teradata. We have run benchmarks to compare the cost and speed of Redshift instances using: • 1.2TB of data and identical queries – same data as previous benchmark http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x- faster-and-cheaper-than-hadoop-same data as previous benchmark http://www.slideshare.net/Hapyrus/amazon- redshift-is-10x-faster-and-cheaper-than-hadoop-hive • 3 different types of instances: Single node XL, two node XL, and two node 8XL www.flydata.com
  • 4. 1. Data Loading • XL x1 instance took 17 hours • XL x2 instance took 10 hours • 8XL x2 instance (equivalent to XL 16x) took 2 hours * The query and data used can be referenced in our Appendix 2h 36m 17h 19m 10h 38m Data loading by COPY command. All the data is copied at once. 2 www.flydata.com
  • 5. 2. Query Speed • XL x1 instance took 155 seconds • XL x2 instance took 55 seconds • 8XL x2 instance (equivalent to XL 16x) took 31 seconds * The query used can be referenced in our Appendix 155.02s 54.77s 30.61s 2 www.flydata.com
  • 6. Redshift Data Loading Results * The data used can be referenced in our Appendix The biggest data (imp_logs: 1.2TB) Other Tables Instance Type Nodes Machine Spec Time (hours) Speed Cost per TB dw.hs1.xlarge 1 1 17.32 70MB/h $12.27 2 2 10.63 110MB/h $15.07 dw.hs1.8xlarge 2 16 2.62 460MB/h $29.64 Loading Time Table Name XL x1 XL x2 8XL x2 ad_campaigns (100MB) 5.82s 8.97s 8.5s Publishers (10MB) 552ms 2.83s 2.22s Advertisers (10MB) 770ms 3.71s 4.75s imp_logs (1.2TB) 17h 19m 22.27s 10h 38m 5.66s 2h 36m 55.49s click_logs (6GB) 11m 31.37s 10m 6.24s 37.87s www.flydata.com
  • 7. Redshift Query Results * The query used can be referenced in our Appendix Processing Time Trial XL x1 XL x2 8XL x2 1 2m 28.65s 1m 1.44s 39.11s 2 2m 37.65s 52.89s 26.77s 3 2m 35.91s 53.76s 29.9s 4 2m 29.04s 53.52s 27.51s 5 2m 36.9s 52.22s 29.75s Average 2m 35s 54.77s 30.69s www.flydata.com
  • 8. Discussion • Redshift can load data in parallel – Almost proportional to number of nodes • Redshift query speed is drastically affected by number of nodes – Big difference between Single and Multiple nodes – Two nodes are faster than two times the speed of one node – Additional experimentation may be helpful in future www.flydata.com
  • 9. APPENDIX - Data TSV files, gzip compressed Imp_lo g 1.2TB / 1.2B record date datetime publisher_id integer ad_campaign_id integer bid_price real country varchar(30) attr1-4 varchar(255) click_l og 5.6GB / 6M record date datetime publisher_id integer ad_campaign_id integer country varchar(30) attr1-4 varchar(255) ad_campai gn 100MB / 100k record publish er 10MB / 10k record advertis er 10MB / 10k record Using 5 tables, we run a query which join tables and creates a report. www.flydata.com
  • 10. appendix – Sample Query select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id); The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. www.flydata.com
  • 11. APPENDIX – Additional Comments • A Redshift 8XL instance cannot be selected as single node (you must choose at least 2 nodes) (as of Feb. 24, 2013) www.flydata.com
  • 12. APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshift- https://github.com/hapyrus/redshift- benchmark – The dataset we use is open on S3, so you can reproduce the benchmark www.flydata.com
  • 13. About Us - FlyData • FlyData Enterprise – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com We are an official data integration partner of Amazon Redshift Formerly known as Hapyrus www.flydata.com
  • 14. www.flydata.com www.flydata.com Check us out! -> http://flydata.com sales@flydata.com Toll Free: 1-855-427-9787 http://flydata.com We are an official data integration partner of Amazon Redshift

×