0
FlyData: Amazon Redshift
BENCHMARK Series 01
Amazon Redshift is
10x faster and cheaper
than Hadoop + Hive
Comparisons of s...
Amazon Redshift took 155 seconds to run our queries for
1.2TB data
Hadoop + Hive took 1491 seconds to run our queries for
...
Amazon Redshift is a new data warehouse for big
data on the cloud. Before Redshift, users had to turn
to Hadoop for queryi...
Prerequisite - Data
TSV files, gzip compressed
Imp_lo
g
1) 300GB / 300M
record
2) 1.2TB / 1.2B record date datetime
publis...
1. Query Speed
• Redshift takes 155
seconds to
complete our query
for 1.2TB
• Hadoop takes
1491 seconds to
complete our qu...
2. Total Cost
• Redshift costs $20
per month to run
queries every 30
minutes
• Hadoop costs $210
per month to run
queries ...
Redshift Query Result
Data Size Instance Type
Number of
Instances
Trial
Processing
Time
Average Server Cost Per Day
300GB ...
Hadoop Query Result
Data Size Instance Type Instance Number Processing Time Server Cost Per Day
300GB
c1.xlarge 1 1h 23m 2...
Discussion
• Consider Redshift
– If your data is big (>TB) and you need to run your
queries more than once an hour
– If yo...
appendix – Sample Query
select
ac.ad_campaign_id as ad_campaign_id,
adv.advertiser_id as advertiser_id,
cs.spending as spe...
APPENDIX - Additional Comments
• Redshift is good for an aggregate calculation such
as sum, average, max, min, etc. becaus...
APPENDIX – Additional Information
• All resources for our benchmark are on
our github repository
– https://github.com/hapy...
About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift,
with real-time data loading
– Aut...
www.flydata.com www.flydata.com
Check us out!
-> http://flydata.com
sales@flydata.com
Toll Free: 1-855-427-9787
http://fly...
Upcoming SlideShare
Loading in...5
×

Amazon Redshift is 10x faster and cheaper than Hadoop + Hive

106,138

Published on

Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides

Published in: Technology, Business
15 Comments
95 Likes
Statistics
Notes
No Downloads
Views
Total Views
106,138
On Slideshare
0
From Embeds
0
Number of Embeds
68
Actions
Shares
0
Downloads
509
Comments
15
Likes
95
Embeds 0
No embeds

No notes for slide

Transcript of "Amazon Redshift is 10x faster and cheaper than Hadoop + Hive"

  1. 1. FlyData: Amazon Redshift BENCHMARK Series 01 Amazon Redshift is 10x faster and cheaper than Hadoop + Hive Comparisons of speed and cost efficiency www.flydata.com
  2. 2. Amazon Redshift took 155 seconds to run our queries for 1.2TB data Hadoop + Hive took 1491 seconds to run our queries for 1.2TB data Amazon Redshift was 10X faster Amazon Redshift cost $20 to run a query every 30 minutes Hadoop + Hive took $210 to run a query every 30 minutes Amazon Redshift was 10X cost effective www.flydata.com
  3. 3. Amazon Redshift is a new data warehouse for big data on the cloud. Before Redshift, users had to turn to Hadoop for querying over TBs of data. We have run benchmarks to compare Redshift to Hadoop (Amazon Elastic MapReduce), both on AWS environments, specifically to show differences for advertisement agencies. • Between 100GB to ~50TB • Frequent query (more than once an hour) • Short turn around time required www.flydata.com
  4. 4. Prerequisite - Data TSV files, gzip compressed Imp_lo g 1) 300GB / 300M record 2) 1.2TB / 1.2B record date datetime publisher_id integer ad_campaign_id integer bid_price real country varchar(30) attr1-4 varchar(255) click_l og 1) 1.4GB / 1.5M record 2) 5.6GB / 6M recorddate datetime publisher_id integer ad_campaign_id integer country varchar(30) attr1-4 varchar(255) 1) for 1 month 2) for 4 months ad_campai gn 100MB / 100k record publish er 10MB / 10k record advertis er 10MB / 10k record We use 5 tables to run a query which join tables and creates a report. www.flydata.com
  5. 5. 1. Query Speed • Redshift takes 155 seconds to complete our query for 1.2TB • Hadoop takes 1491 seconds to complete our query for 1.2TB • Redshift is about 10 times faster than Hadoop for this query Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge). 672sec 38sec 155sec 1491sec * The query used can be referenced in our Appendix www.flydata.com
  6. 6. 2. Total Cost • Redshift costs $20 per month to run queries every 30 minutes • Hadoop costs $210 per month to run queries every 30 minutes • Redshift is about 10 times cheaper than Hadoop to run this job Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time. * The query used can be referenced in our Appendix www.flydata.com
  7. 7. Redshift Query Result Data Size Instance Type Number of Instances Trial Processing Time Average Server Cost Per Day 300GB dw.hs1.xlarge 1 1 58s 38s $20.40 2 43s 3 31s 4 30s 5 30s 1.2TB dw.hs1.xlarge 1 1 164s 155s $20.40 2 149s 3 158s 4 156s 5 150s * The query used can be referenced in our Appendix www.flydata.com
  8. 8. Hadoop Query Result Data Size Instance Type Instance Number Processing Time Server Cost Per Day 300GB c1.xlarge 1 1h 23m 2s $0.80 c1.medium 10 37m 48s $0.89 c1.xlarge 10 11m 12s $1.06 1.2TB m1.xlarge 1 6h 43m 24s $3.22 c1.medium 4 5h 14m 0s $3.04 c1.xlarge 10 37m 7s $3.58 c1.xlarge 20 24m 51s $4.64 * The query used can be referenced in our Appendix www.flydata.com
  9. 9. Discussion • Consider Redshift – If your data is big (>TB) and you need to run your queries more than once an hour – If you want to get quick results • Consider Hadoop (EMR) – If your data is too big (>PB) – If your job queries are once a day, week or month – If you already have invested in Hadoop technology specialists www.flydata.com
  10. 10. appendix – Sample Query select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id); The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. www.flydata.com
  11. 11. APPENDIX - Additional Comments • Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database • Importing large amounts of data takes a lot of time – 17 hours for 1.2TB in our case – Continuous importing is useful • Redshift supports only “Separated” formats like CSV, TSV – JSON is not supported • Redshift supports only primitive data types – 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE.. (as of Feb. 17, 2013) www.flydata.com
  12. 12. APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshift- benchmark – The dataset we use is open on S3, so you can reproduce the benchmark www.flydata.com
  13. 13. About Us - FlyData • FlyData Enterprise – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com We are an official data integration partner of Amazon Redshift Formerly known as Hapyrus www.flydata.com
  14. 14. www.flydata.com www.flydata.com Check us out! -> http://flydata.com sales@flydata.com Toll Free: 1-855-427-9787 http://flydata.com We are an official data integration partner of Amazon Redshift
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×