Amazon Redshift is 10x faster and cheaper than Hadoop + Hive

Like this? Share it with your network

Share

Amazon Redshift is 10x faster and cheaper than Hadoop + Hive

  • 102,308 views
Uploaded on

Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides

Our blog post: http://www.flydata.com/blog/posts/behind-amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive-slides

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Anyone here has info about comparison Amazon Redshift and Google Bigquery?
    Are you sure you want to
    Your message goes here
  • @SeanScott5 amen
    Are you sure you want to
    Your message goes here
  • If interested in the Redshift technology, join our new discussion group.
    www.linkedin.com/groups/Redshift-Professionals-4884099
    Alex Friedgan, group moderator
    Are you sure you want to
    Your message goes here
  • This is the same tired story with all the column databases that say they are better then hive.
    'Our queries take 1 seconds and ARE SUPER FAST' (it only takes us 17 hours to insert the data into a columnar format.) That is the problem, data comes usually in row orientated format.
    Are you sure you want to
    Your message goes here
  • Thanks a lot for your help... i am almost there :) i was able to download the 'ad_campaigns', 'advertisers' and 'click_logs' zipped data files successfully. but i don't see the 'publishers' and 'imp_logs' ..
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
102,308
On Slideshare
80,100
From Embeds
22,208
Number of Embeds
181

Actions

Shares
Downloads
440
Comments
13
Likes
89

Embeds 22,208

http://techcrunch.com 6,453
http://www.hapyrus.com 2,252
http://www.csdn.net 2,094
http://www.bigdatanosql.com 2,038
http://jp.techcrunch.com 1,550
http://www.google.com 901
http://www.techcrunch.com 825
http://bigdata.blog.hu 678
https://twitter.com 663
https://pulse.neudesic.com 657
https://www.flydata.com 580
http://www.scoop.it 548
http://www.startup-dating.com 316
http://www.flydata.com 272
http://www11.dev-chop-chop.org 251
http://localhost 157
http://www-ig-opensocial.googleusercontent.com 142
http://feedspot.com 125
http://feeds.feedburner.com 112
http://www.newsblur.com 108
http://qa.scoop.it 101
http://www.linkedin.com 82
http://flydata.com 82
http://m.csdn.net 79
http://staging.hapyrus.com 77
http://www.megapivot.com 64
http://dev.newsblur.com 59
http://newsblur.com 54
http://thebridge.jp 53
http://app02.hapyrus.com 49
http://www.pulse.me 45
http://www.redditmedia.com 43
http://imac.newsalligator.com 40
http://www.paragonware.com 37
https://www.hapyrus.com 33
https://flydata.com 27
http://seo-lpo.net 25
http://ptcms.csdn.net 20
http://www.xici.net 20
http://www.hanrss.com 19
http://www.tuicool.com 18
http://bellucci.devlh.vlaanderen.be 18
http://cumulations.tumblr.com 18
http://dragansr.blogspot.com 18
http://mediosdetecnologia.blogspot.com 13
http://news.csdn.net 11
http://kred.com 11
http://news.livedoor.com 10
http://jp.startup-dating.com 10
http://sadani.feedspot.com 9

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. FlyData: Amazon Redshift BENCHMARK Series 01 Comparisons of speed and cost efficiency Amazon Redshift is 10x faster and cheaper than Hadoop + Hive
  • 2. Amazon Redshift took 155 seconds to run our queries for 1.2TB data Hadoop + Hive took 1491 seconds to run our queries for 1.2TB data Amazon Redshift was 10X faster Amazon Redshift cost $20 to run a query every 30 minutes Hadoop + Hive took $210 to run a query every 30 minutes Amazon Redshift was 10X cost effective
  • 3. Amazon Redshift is a new data warehouse for big data on the cloud. Before Redshift, users had to turn to Hadoop for querying over TBs of data. We have run benchmarks to compare Redshift to Hadoop (Amazon Elastic MapReduce), both on AWS environments, specifically to show differences for advertisement agencies. • Between 100GB to ~50TB • Frequent query (more than once an hour) • Short turn around time required
  • 4. Prerequisite - Data We use 5 tables to run a query which join tables and creates a report. Imp_log click_log 1) 300GB / 300M record 2) 1.2TB / 1.2B record date publisher_id ad_campaign_id country attr1-4 ad_campaign 1) 1.4GB / 1.5M record 2) 5.6GB / 6M record datetime integer integer varchar(30) varchar(255) date publisher_id ad_campaign_id bid_price country attr1-4 datetime integer integer real varchar(30) varchar(255) 100MB / 100k record publisher 10MB / 10k record advertiser 10MB / 10k record 1) for 1 month 2) for 4 months TSV files, gzip compressed
  • 5. 1. Query Speed Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge). 1491sec 672sec 155sec 38sec * The query used can be referenced in our Appendix •• Redshift takes 155 Redshift takes 155 seconds to seconds to complete our query complete our query for 1.2TB for 1.2TB •• Hadoop takes Hadoop takes 1491 seconds to 1491 seconds to complete our query complete our query for 1.2TB for 1.2TB •• Redshift is about Redshift is about 10 times faster 10 times faster than Hadoop for than Hadoop for this query this query
  • 6. 2. Total Cost Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time. •• Redshift costs $20 Redshift costs $20 per month to run per month to run queries every 30 queries every 30 minutes minutes •• Hadoop costs $210 Hadoop costs $210 per month to run per month to run queries every 30 queries every 30 minutes minutes •• Redshift is about Redshift is about 10 times cheaper 10 times cheaper than Hadoop to run than Hadoop to run this job this job * The query used can be referenced in our Appendix
  • 7. Redshift Query Result Data Size Instance Type Number of Instances Trial Processing Time 1 164s 149s 3 158s 156s 5 1 30s 4 dw.hs1.xlarge 30s 2 1.2TB 31s 1 1 3 5 dw.hs1.xlarge 43s 4 300GB 150s Server Cost Per Day 58s 2 Average * The query used can be referenced in our Appendix 38s $20.40 155s $20.40
  • 8. Hadoop Query Result Data Size Instance Type Instance Number c1.xlarge c1.medium 1h 23m Server Cost Per Day 2s $0.80 10 37m 48s $0.89 c1.xlarge 10 11m 12s $1.06 m1.xlarge 1 6h 43m 24s $3.22 c1.medium 4 5h 14m 0s $3.04 c1.xlarge 10 37m 7s $3.58 c1.xlarge 300GB 1 Processing Time 20 24m 51s $4.64 1.2TB * The query used can be referenced in our Appendix
  • 9. Discussion • Consider Redshift – If your data is big (>TB) and you need to run your queries more than once an hour – If you want to get quick results • Consider Hadoop (EMR) – If your data is too big (>PB) – If your job queries are once a day, week or month – If you already have invested in Hadoop technology specialists
  • 10. appendix – Sample Query The query generates a basic report for ad campaigns performance, imp, click numbers, advertiser spending, CTR, CPC and CPM. select ac.ad_campaign_id as ad_campaign_id, adv.advertiser_id as advertiser_id, cs.spending as spending, ims.imp_total as imp_total, cs.click_total as click_total, click_total/imp_total as CTR, spending/click_total as CPC, spending/(imp_total/1000) as CPM from ad_campaigns ac join advertisers adv on (ac.advertiser_id = adv.advertiser_id) join (select il.ad_campaign_id, count(*) as imp_total from imp_logs il group by il.ad_campaign_id ) ims on (ims.ad_campaign_id = ac.ad_campaign_id) join (select cl.ad_campaign_id, sum(cl.bid_price) as spending, count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
  • 11. APPENDIX - Additional Comments • • Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database Importing large amounts of data takes a lot of time – 17 hours for 1.2TB in our case – Continuous importing is useful • Redshift supports only “Separated” formats like CSV, TSV – JSON is not supported • Redshift supports only primitive data types – 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE.. (as of Feb. 17, 2013)
  • 12. APPENDIX – Additional Information • All resources for our benchmark are on our github repository – https://github.com/hapyrus/redshift-benchmark – The dataset we use is open on S3, so you can reproduce the benchmark
  • 13. About Us - FlyData Formerly known as Hapyrus • FlyData Enterprise We are an official data integration partner of Amazon Redshift – Enables continuous loading to Amazon Redshift, with real-time data loading – Automated ETL process with multiple supported data formats – Auto scaling, data Integrity and high durability – FlyData Sync feature allows real-time replication from RDBMS to Amazon Redshift Contact us at: info@flydata.com