afriedganIf interested in the Redshift technology, join our new discussion group. www.linkedin.com/groups/Redshift-Professionals-4884099 Alex Friedgan, group moderator3 months ago
Are you sure you want to
Edward CaprioloThis is the same tired story with all the column databases that say they are better then hive. 'Our queries take 1 seconds and ARE SUPER FAST' (it only takes us 17 hours to insert the data into a columnar format.) That is the problem, data comes usually in row orientated format.3 months ago
Are you sure you want to
dasudayanThanks a lot for your help... i am almost there :) i was able to download the 'ad_campaigns', 'advertisers' and 'click_logs' zipped data files successfully. but i don't see the 'publishers' and 'imp_logs' ..3 months ago
Are you sure you want to
Hapyrus at Hapyrus@dasudayan Your error message(SignatureDoesNotMatch) is displayed if you don't use correct AWS access key ID and secret key pair. The bucket is open for everyone, so we think you can access without AWS keys by your own S3 tools. For example, you can download http://hapyrus-examples.s3.amazonaws.com/redshift-benchmark/ad-network-examples/case-01/ad_campaigns/ad_campaigns.gz on your browser. Could you try it again without keys? We made sure again that the list/view permission of bucket (hapyrus-examples) and objects under the path is open to everyone.3 months ago
Are you sure you want to
dasudayanThanks .. i still get HTTP Status Code: 403 AWS Error Code: SignatureDoesNotMatch e,g. when i try to access the bucket 'hapyrus-examples' or list all buckets.. (the AWS doc says '..because its owned by some other account'). Is your S3 objects set to be accessed publicly (using SDKs, APIs etc).. thanks again for your help.3 months ago
Are you sure you want to
Hapyrus at Hapyrus@dasudayan Thank you for your trial. Actually the path you tried is not a correct one.. Please try S3 objects under http://hapyrus-examples.s3.amazonaws.com/redshift-benchmark/ad-network-examples/case-01/. We'll update our documents on github to clarify.3 months ago
Are you sure you want to
dasudayantrying to download the dataset from s3 using AWS Java SDK and getting access denied (I do have S3 account) https://s3.amazonaws.com/hapyrus-examples/redshift-benchmark/1month-multi-table/ad_campaigns/ad_campaigns.gz any idea ?3 months ago
Are you sure you want to
Hapyrus at HapyrusThank you for your interest. We have made efforts to try to make sure there is enough text and data in our slides and have proofread it many times but, we are only human and mistakes are made. If you can, could you please show us our mistakes so others will not suffer as you have from our grammar mistakes. As for the actual data, we have prepared necessary data, queries, and scripts for you to reproduce this benchmark on AWS S3 at our github (https://github.com/hapyrus/redshift-benchmark) as described on the slide. The benchmark is easily duplicated so any user can make sure our results are real.3 months ago
Are you sure you want to
Sean ScottI realize this might be quaint and judgmental, but I place very little credibility in a presentation full of poor grammar. If there was no apparent effort made to proofread the publication, I can't help but wonder how careful the methodology was in the underlying tests.3 months ago
Amazon Redshift is 10x faster and cheaper than Hadoop + HivePresentation Transcript
Hapyrus: Amazon RedshiftBENCHMARK Series 01 Amazon Redshift is10x faster and cheaper than Hadoop + Hive COMPARISONS OF SPEED AND COST EFFICIENCY
Amazon Redshift took 155 seconds to run our queries for1.2TB dataHadoop + Hive took 1491 seconds to run our queries for1.2TB dataAmazon Redshift was 10X fasterAmazon Redshift cost $20 to run a query every 30 minutesHadoop + Hive took $210 to run a query every 30 minutesAmazon Redshift was 10X cost effective
Amazon Redshift is a new data warehouse for bigdata on the cloud. Before Redshift, users had to turnto Hadoop for querying over TBs of data.We have run benchmarks to compare Redshift toHadoop (Amazon Elastic MapReduce), both onAWS environments, specifically to show differencesfor advertisement agencies. › Between 100GB to ~50TB › Frequent query (more than once an hour) › Short turn around time required
Prerequisite - DataWe use 5 tables to run a query which join tables and creates a report. Imp_log click_log ad_campaign1) 300GB / 300M record 1) 1.4GB / 1.5M record 100MB / 100k record2) 1.2TB / 1.2B record 2) 5.6GB / 6M record publisher date datetime date datetime publisher_id integer publisher_id integer 10MB / 10k record ad_campaign_id integer ad_campaign_id integer country varchar(30) bid_price real attr1-4 varchar(255) country varchar(30) attr1-4 varchar(255) advertiser 10MB / 10k record 1) for 1 month 2) for 4 months TSV files, gzip compressed
1. Query Speed Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift: dw.hs1.xlarge). Query Speed • Redshift takes 155 1600 1491sec seconds to complete Processing Time (seconds) 1400 Redshift our query for 1.2TB Hadoop 1200 1000 • Hadoop takes 1491 800 672sec seconds to complete 600 our query for 1.2TB 400 155sec 200 38sec • Redshift is about 10 0 times faster than 300GB 1.2TB Hadoop for this Data Size query * The query used can be referenced in our Appendix
2. Total Cost Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of time. • Redshift costs $20 Cost Per Day (query for 300GB data size) per month to run $400 queries every 30 Redshift minutes $350 Hadoop $300 Cost Per Day (US$) • Hadoop costs $210 $250 per month to run $200 queries every 30 $150 minutes $100 $50 • Redshift is about 10 times cheaper than $0 Hadoop to run this 0 50 100 150 200 250 Query Per Day job * The query used can be referenced in our Appendix
Redshift Query Result Number of Processing Data Size Instance Type Trial Average Server Cost Per Day Instances Time 1 58s 2 43s 300GB dw.hs1.xlarge 1 3 31s 38s $20.40 4 30s 5 30s 1 164s 2 149s 1.2TB dw.hs1.xlarge 1 3 158s 155s $20.40 4 156s 5 150s * The query used can be referenced in our Appendix
Hadoop Query Result Data Size Instance Type Instance Number Processing Time Server Cost Per Day c1.xlarge 1 1h 23m 2s $0.80 300GB c1.medium 10 37m 48s $0.89 c1.xlarge 10 11m 12s $1.06 m1.xlarge 1 6h 43m 24s $3.22 c1.medium 4 5h 14m 0s $3.04 1.2TB c1.xlarge 10 37m 7s $3.58 c1.xlarge 20 24m 51s $4.64 * The query used can be referenced in our Appendix
Discussion› Consider Redshift » If your data is big (>TB) and you need to run your queries more than once an hour » If you want to get quick results› Consider Hadoop (EMR) » If your data is too big (>PB) » If your job queries are once a day, week or month » If you already have invested in Hadoop technology specialists
APPENDIX – Sample QueryThe query generates a basic report for ad campaigns performance, imp, click numbers,advertiser spending, CTR, CPC and CPM. join select (select ac.ad_campaign_id as ad_campaign_id, il.ad_campaign_id, adv.advertiser_id as advertiser_id, count(*) as imp_total cs.spending as spending, from ims.imp_total as imp_total, imp_logs il cs.click_total as click_total, group by click_total/imp_total as CTR, il.ad_campaign_id spending/click_total as CPC, ) ims on (ims.ad_campaign_id = spending/(imp_total/1000) as CPM ac.ad_campaign_id) from join ad_campaigns ac (select join cl.ad_campaign_id, advertisers adv sum(cl.bid_price) as spending, on (ac.advertiser_id = adv.advertiser_id) count(*) as click_total from click_logs cl group by cl.ad_campaign_id ) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
APPENDIX - Additional Comments› Redshift is good for an aggregate calculation such as sum, average, max, min, etc. because it is a columnar database› Importing large amounts of data takes a lot of time » 17 hours for 1.2TB in our case » Continuous importing is useful› Redshift supports only “Separated” formats like CSV, TSV » JSON is not supported› Redshift supports only primitive data types » 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE.. (as of Feb. 17, 2013)
APPENDIX – Additional Information› All resources for our benchmark are on our github repository » https://github.com/hapyrus/redshift-benchmark » The dataset we use is open on S3, so you can reproduce the benchmark
About Us - Hapyrus› FlyData for Amazon Redshift » Enables Redshift users to start on their own data from Day 1 » Near-realtime data transfer to Redshift » Auto scaling, Data Integrity and designed for High durability› Also provide Redshift introduction consulting servicehttp://hapyrus.com/ or info@hapyrus.com
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare
1–10 of 12 previous next Post a comment
1–10 of 12 previous next