Your SlideShare is downloading. ×
  • Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Amazon Redshift is 10x faster and cheaper than Hadoop + Hive

  • 100,262 views
Published

Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides

Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Anyone here has info about comparison Amazon Redshift and Google Bigquery?
    Are you sure you want to
    Your message goes here
  • @SeanScott5 amen
    Are you sure you want to
    Your message goes here
  • If interested in the Redshift technology, join our new discussion group.
    www.linkedin.com/groups/Redshift-Professionals-4884099
    Alex Friedgan, group moderator
    Are you sure you want to
    Your message goes here
  • This is the same tired story with all the column databases that say they are better then hive.
    'Our queries take 1 seconds and ARE SUPER FAST' (it only takes us 17 hours to insert the data into a columnar format.) That is the problem, data comes usually in row orientated format.
    Are you sure you want to
    Your message goes here
  • Thanks a lot for your help... i am almost there :) i was able to download the 'ad_campaigns', 'advertisers' and 'click_logs' zipped data files successfully. but i don't see the 'publishers' and 'imp_logs' ..
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
100,262
On SlideShare
0
From Embeds
0
Number of Embeds
64

Actions

Shares
Downloads
456
Comments
13
Likes
90

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. FlyData: Amazon Redshift BENCHMARK Series 01 Amazon Redshift is 10x faster and cheaper than Hadoop + Hive Comparisons of speed and cost efficiency www.flydata.com
  • 2. Amazon Redshift took 155 seconds to run our queries for 1.2TB data Hadoop + Hive took 1491 seconds to run our queries for 1.2TB data Amazon Redshift was 10X faster Amazon Redshift cost $20 to run a query every 30 minutes Hadoop + Hive took $210 to run a query every 30 minutes Amazon Redshift was 10X cost effective www.flydata.com
  • 3. Amazon Redshift is a new data warehouse for big data on the cloud. Before Redshift, users had to turn to Hadoop for querying over TBs of data. We have run benchmarks to compare Redshift to Hadoop (Amazon Elastic MapReduce), both on AWS environments, specifically to show differences for advertisement agencies. • Between 100GB to ~50TB • Frequent query (more than once an hour) • Short turn around time required www.flydata.com
  • 4. Prerequisite - Data TSV files, gzip compressed Imp_lo g 1) 300GB / 300M record 2) 1.2TB / 1.2B record date datetime publisher_id integer ad_campaign_id integer bid_price real country varchar(30) attr1-4 varchar(255) click_l og 1) 1.4GB / 1.5M record 2) 5.6GB / 6M recorddate datetime publisher_id integer ad_campaign_id integer country varchar(30) attr1-4 varchar(255) 1) for 1 month 2) for 4 months ad_campai gn 100MB / 100k record publish er 10MB / 10k record advertis er 10MB / 10k record We use 5 tables to run a query which join tables and creates a report. www.flydata.com
  • 5. 1. Query Speed • Redshift takes 155 seconds to complete our query for 1.2TB • Hadoop takes 1491 seconds to complete our query for 1.2TB • Redshift is about 10 times faster than Hadoop for this query Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge). 672sec 38sec 155sec 1491sec * The query used can be referenced in our Appendix www.flydata.com
  • 6. 2. Total Cost • Redshift costs $20 per month to run queries every 30 minutes • Hadoop costs $210 per month to run queries every 30 minutes • Redshift is about 10 times cheaper than Hadoop to run this job Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time. * The query used can be referenced in our Appendix www.flydata.com
  • 7. Redshift Query Result Data Size Instance Type Number of Instances Trial Processing Time Average Server Cost Per Day 300GB dw.hs1.xlarge 1 1 58s 38s $20.40 2 43s 3 31s 4 30s 5 30s 1.2TB dw.hs1.xlarge 1 1 164s 155s $20.40 2 149s 3 158s 4 156s 5 150s * The query used can be referenced in our Appendix www.flydata.com
  • 8. Hadoop Query Result Data Size Instance Type Instance Number Processing Time Server Cost Per Day 300GB c1.xlarge 1 1h 23m 2s $0.80 c1.medium 10 37m 48s $0.89 c1.xlarge 10 11m 12s $1.06 1.2TB m1.xlarge 1 6h 43m 24s $3.22 c1.medium 4 5h 14m 0s $3.04 c1.xlarge 10 37m 7s $3.58 c1.xlarge 20 24m 51s $4.64 * The query used can be referenced in our Appendix www.flydata.com
  • 9. Discussion • Consider Redshift – If your data is big (>TB) and you need to run your queries more than once an hour – If you want to get quick results • Consider Hadoop (EMR) – If your data is too big (>PB) – If your job queries are once a day, week or month – If you already have invested in Hadoop technology specialists www.flydata.com
  • 10. appendix – Sample Query select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id); The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. www.flydata.com
  • 11. APPENDIX - Additional Comments • Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database • Importing large amounts of data takes a lot of time – 17 hours for 1.2TB in our case – Continuous importing is useful • Redshift supports only “Separated” formats like CSV, TSV – JSON is not supported • Redshift supports only primitive data types – 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE.. (as of Feb. 17, 2013) www.flydata.com
  • 12. APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshift- benchmark – The dataset we use is open on S3, so you can reproduce the benchmark www.flydata.com
  • 13. About Us - FlyData • FlyData Enterprise – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com We are an official data integration partner of Amazon Redshift Formerly known as Hapyrus www.flydata.com
  • 14. www.flydata.com www.flydata.com Check us out! -> http://flydata.com sales@flydata.com Toll Free: 1-855-427-9787 http://flydata.com We are an official data integration partner of Amazon Redshift