Scalability of Amazon Redshift Data Loading and Query Speed
 

Scalability of Amazon Redshift Data Loading and Query Speed

on

  • 11,639 views

Our blog post: http://www.flydata.com/blog/posts/scalability-of-amazon-redshift-data-loading-and-query-speeds

Our blog post: http://www.flydata.com/blog/posts/scalability-of-amazon-redshift-data-loading-and-query-speeds

Statistics

Views

Total Views
11,639
Views on SlideShare
5,549
Embed Views
6,090

Actions

Likes
6
Downloads
94
Comments
0

35 Embeds 6,090

http://news.ycombinator.com 3535
http://www.hapyrus.com 1816
http://www.google.com 257
http://www.flydata.com 82
https://twitter.com 81
http://flydata.com 70
http://localhost 69
https://www.flydata.com 56
http://app02.hapyrus.com 21
http://www.pulse.me 20
http://hr-pulsesubscriber.appspot.com 20
http://app01.hapyrus.com 11
http://staging-app.hapyrus.com 8
http://app03.hapyrus.com 7
http://app04.hapyrus.com 7
http://nuevospowerpoints.blogspot.com 5
http://www.linkedin.com 3
https://www-staging.flydata.com 3
http://nuevospowerpoints.blogspot.mx 2
http://nuevospowerpoints.blogspot.hu 2
http://pulse.me&_=1361884961519 HTTP 1
http://pulse.me&_=1361852421779 HTTP 1
http://www-beta.flydata.com 1
http://pulse.me&_=1361855360338 HTTP 1
http://pulse.me&_=1361852483050 HTTP 1
http://hapyrus.rhmdev.com 1
http://pulse.me&_=1361854091835 HTTP 1
https://abs.twimg.com 1
http://pulse.me&_=1361857553516 HTTP 1
http://pulse.me&_=1361859949253 HTTP 1
http://pulse.me&_=1361869768865 HTTP 1
http://pulse.me&_=1361873465438 HTTP 1
http://pulse.me&_=1361896738023 HTTP 1
http://pulse.me&_=1361891803809 HTTP 1
http://pulse.me&_=1361887829877 HTTP 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scalability of Amazon Redshift Data Loading and Query Speed Scalability of Amazon Redshift Data Loading and Query Speed Presentation Transcript

    • Hapyrus: Amazon Redshift BENCHMARK Series 02 Comparisons between the performance of different instances Scalability of Amazon Redshift Data Loading and Query Speed
    • Amazon Redshift can load 1.2TB data using: •an XL instance, taking 17 hours •a two node XL instance, taking 10 hours •a two node 8XL instance (equivalent to XL 16x), taking 2 hours Load speed is almost proportional to number of nodes Running identical queries on Amazon Redshift, •an XL instance took 155 seconds •a two node XL instance took 55 seconds •a two node 8XL instance (equivalent to XL 16x) took 31 seconds The more nodes, the faster a query runs (but not by much, seems more inversely proportional)
    • The most important feature of Amazon Redshift is that it has flexible scalability compared to other columnar data warehouses such as Vertica, Netezza and Teradata. We have run benchmarks to compare the cost and speed of Redshift instances using: • 1.2TB of data and identical queries – same data as previous benchmark http://www.slideshare.net/Hapyrus/amazon-redshift-is-10xfaster-and-cheaper-than-hadoop-hive • 3 different types of instances: Single node XL, two node XL, and two node 8XL
    • 1. Data Loading Data loading by COPY command. All the data is copied at once. •• 2h 36m 2 XL x2 instance XL x2 instance took 10 hours took 10 hours •• 10h 38m XL x1 instance XL x1 instance took 17 hours took 17 hours •• 17h 19m 8XL x2 instance 8XL x2 instance (equivalent to XL (equivalent to XL 16x) took 2 16x) took 2 hours hours * The query and data used can be referenced in our Appendix
    • 2. Query Speed •• •• 155.02s XL x1 instance XL x1 instance took 155 took 155 seconds seconds XL x2 instance XL x2 instance took 55 seconds took 55 seconds •• 8XL x2 instance 8XL x2 instance (equivalent to XL (equivalent to XL 16x) took 31 16x) took 31 seconds seconds 54.77s 30.61s 2 * The query used can be referenced in our Appendix
    • Redshift Data Loading Results The biggest data (imp_logs: 1.2TB) Instance Type Nodes Machine Spec Time (hours) Speed Cost per TB 1 1 17.32 70MB/h $12.27 2 2 10.63 110MB/h $15.07 2 16 2.62 460MB/h $29.64 dw.hs1.xlarge dw.hs1.8xlarge Other Tables Loading Time Table Name XL x1 XL x2 8XL x2 ad_campaigns (100MB) 5.82s 8.97s 8.5s Publishers (10MB) 552ms 2.83s 2.22s Advertisers (10MB) 770ms 3.71s 4.75s imp_logs (1.2TB) 17h 19m 22.27s 10h 38m 5.66s 2h 36m 55.49s click_logs (6GB) 11m 31.37s 10m 6.24s 37.87s * The data used can be referenced in our Appendix
    • Redshift Query Results Processing Time Trial XL x1 XL x2 8XL x2 1 2m 28.65s 1m 1.44s 39.11s 2 2m 37.65s 52.89s 26.77s 3 2m 35.91s 53.76s 29.9s 4 2m 29.04s 53.52s 27.51s 5 2m 36.9s 52.22s 29.75s 2m 35s 54.77s 30.69s Average * The query used can be referenced in our Appendix
    • Discussion • Redshift can load data in parallel – Almost proportional to number of nodes • Redshift query speed is drastically affected by number of nodes – Big difference between Single and Multiple nodes – Two nodes are faster than two times the speed of one node – Additional experimentation may be helpful in future
    • APPENDIX - Data Using 5 tables, we run a query which join tables and creates a report. Imp_log click_log ad_campaign 1.2TB / 1.2B record 5.6GB / 6M record date publisher_id ad_campaign_id country attr1-4 date publisher_id ad_campaign_id bid_price country attr1-4 datetime integer integer varchar(30) varchar(255) datetime integer integer real varchar(30) varchar(255) 100MB / 100k record publisher 10MB / 10k record advertiser 10MB / 10k record TSV files, gzip compressed
    • appendix – Sample Query The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
    • APPENDIX – Additional Comments • A Redshift 8XL instance cannot be selected as single node (you must choose at least 2 nodes) (as of Feb. 24, 2013)
    • APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshiftbenchmark – The dataset we use is open on S3, so you can reproduce the benchmark
    • About Us - FlyData Formerly known as Hapyrus • FlyData Enterprise We are an official data integration partner of Amazon Redshift – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com