Amazon Redshift is 10x faster and cheaper than Hadoop + Hive
1. FlyData: Amazon Redshift
BENCHMARK Series 01
Amazon Redshift is
10x faster and cheaper
than Hadoop + Hive
Comparisons of speed and cost efficiency
www.flydata.com
2. Amazon Redshift took 155 seconds to run our queries for
1.2TB data
Hadoop + Hive took 1491 seconds to run our queries for
1.2TB data
Amazon Redshift was 10X faster
Amazon Redshift cost $20 to run a query every 30 minutes
Hadoop + Hive took $210 to run a query every 30 minutes
Amazon Redshift was 10X cost effective
www.flydata.com
3. Amazon Redshift is a new data warehouse for big
data on the cloud. Before Redshift, users had to turn
to Hadoop for querying over TBs of data.
We have run benchmarks to compare Redshift to
Hadoop (Amazon Elastic MapReduce), both on
AWS environments, specifically to show differences
for advertisement agencies.
• Between 100GB to ~50TB
• Frequent query (more than once an hour)
• Short turn around time required
www.flydata.com
4. Prerequisite - Data
TSV files, gzip compressed
Imp_lo
g
1) 300GB / 300M
record
2) 1.2TB / 1.2B record date datetime
publisher_id integer
ad_campaign_id integer
bid_price real
country varchar(30)
attr1-4 varchar(255)
click_l
og
1) 1.4GB / 1.5M
record
2) 5.6GB / 6M recorddate datetime
publisher_id integer
ad_campaign_id integer
country varchar(30)
attr1-4 varchar(255)
1) for 1 month
2) for 4
months
ad_campai
gn
100MB / 100k
record
publish
er
10MB / 10k
record
advertis
er
10MB / 10k
record
We use 5 tables to run a query which join tables and creates a report.
www.flydata.com
5. 1. Query Speed
• Redshift takes 155
seconds to
complete our query
for 1.2TB
• Hadoop takes
1491 seconds to
complete our query
for 1.2TB
• Redshift is about
10 times faster
than Hadoop for
this query
Here, we are comparing Hadoop and Redshift servers of the same cost. (Hadoop: c1.xlarge vs Redshift:
dw.hs1.xlarge).
672sec
38sec
155sec
1491sec
* The query used can be referenced in our Appendix
www.flydata.com
6. 2. Total Cost
• Redshift costs $20
per month to run
queries every 30
minutes
• Hadoop costs $210
per month to run
queries every 30
minutes
• Redshift is about
10 times cheaper
than Hadoop to run
this job
Here, we are comparing Hadoop and Redshift servers running the same query for the same duration of
time.
* The query used can be referenced in our Appendix
www.flydata.com
7. Redshift Query Result
Data Size Instance Type
Number of
Instances
Trial
Processing
Time
Average Server Cost Per Day
300GB dw.hs1.xlarge 1
1 58s
38s $20.40
2 43s
3 31s
4 30s
5 30s
1.2TB dw.hs1.xlarge 1
1 164s
155s $20.40
2 149s
3 158s
4 156s
5 150s
* The query used can be referenced in our Appendix
www.flydata.com
8. Hadoop Query Result
Data Size Instance Type Instance Number Processing Time Server Cost Per Day
300GB
c1.xlarge 1 1h 23m 2s $0.80
c1.medium 10 37m 48s $0.89
c1.xlarge 10 11m 12s $1.06
1.2TB
m1.xlarge 1 6h 43m 24s $3.22
c1.medium 4 5h 14m 0s $3.04
c1.xlarge 10 37m 7s $3.58
c1.xlarge 20 24m 51s $4.64
* The query used can be referenced in our Appendix
www.flydata.com
9. Discussion
• Consider Redshift
– If your data is big (>TB) and you need to run your
queries more than once an hour
– If you want to get quick results
• Consider Hadoop (EMR)
– If your data is too big (>PB)
– If your job queries are once a day, week or month
– If you already have invested in Hadoop
technology specialists
www.flydata.com
10. appendix – Sample Query
select
ac.ad_campaign_id as ad_campaign_id,
adv.advertiser_id as advertiser_id,
cs.spending as spending,
ims.imp_total as imp_total,
cs.click_total as click_total,
click_total/imp_total as CTR,
spending/click_total as CPC,
spending/(imp_total/1000) as CPM
from
ad_campaigns ac
join
advertisers adv
on (ac.advertiser_id = adv.advertiser_id)
join
(select
il.ad_campaign_id,
count(*) as imp_total
from
imp_logs il
group by
il.ad_campaign_id
) ims on (ims.ad_campaign_id =
ac.ad_campaign_id)
join
(select
cl.ad_campaign_id,
sum(cl.bid_price) as spending,
count(*) as click_total
from
click_logs cl
group by
cl.ad_campaign_id
) cs on (cs.ad_campaign_id = ac.ad_campaign_id);
The query generates a basic report for ad campaigns performance, imp, click numbers,
advertiser spending, CTR, CPC and CPM.
www.flydata.com
11. APPENDIX - Additional Comments
• Redshift is good for an aggregate calculation such
as sum, average, max, min, etc. because it is a
columnar database
• Importing large amounts of data takes a lot of time
– 17 hours for 1.2TB in our case
– Continuous importing is useful
• Redshift supports only “Separated” formats like
CSV, TSV
– JSON is not supported
• Redshift supports only primitive data types
– 11 types, INT, DOUBLE, BOOLEAN, VARCHAR, DATE..
(as of Feb. 17,
2013)
www.flydata.com
12. APPENDIX – Additional Information
• All resources for our benchmark are on
our github repository
– https://github.com/hapyrus/redshift-
benchmark
– The dataset we use is open on S3, so you
can reproduce the benchmark
www.flydata.com
13. About Us - FlyData
• FlyData Enterprise
– Enables continuous loading to Amazon Redshift,
with real-time data loading
– Automated ETL process with multiple supported
data formats
– Auto scaling, data Integrity and high durability
– FlyData Sync feature allows real-time replication
from RDBMS to Amazon Redshift
Contact us at: info@flydata.com
We are an official data
integration partner of
Amazon Redshift
Formerly known as Hapyrus
www.flydata.com
14. www.flydata.com www.flydata.com
Check us out!
-> http://flydata.com
sales@flydata.com
Toll Free: 1-855-427-9787
http://flydata.com
We are an official data integration
partner of Amazon Redshift