Submit Search
Upload
COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
•
0 likes
•
282 views
AI-enhanced title
Prashant Ratnaparkhi
Follow
Redshift & Hive Comparison
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 21
Recommended
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
Ordered Record Collection
Ordered Record Collection
Hadoop User Group
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
Intro To PostGIS
Intro To PostGIS
mleslie
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
Recommended
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
Ordered Record Collection
Ordered Record Collection
Hadoop User Group
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
Intro To PostGIS
Intro To PostGIS
mleslie
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
scalable machine learning
scalable machine learning
Samir Bessalah
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Jeff McKenna
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
DonnyV
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
The HDF-EOS Tools and Information Center
Mapserver vs. geoserver
Mapserver vs. geoserver
鸣 饶
WMS Performance Shootout 2010
WMS Performance Shootout 2010
Jeff McKenna
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Kohei KaiGai
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
JMA_447
hadoop
hadoop
longhao
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
JMA_447
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Hongtao Cai
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
InfluxData
WMS Performance Shootout 2011
WMS Performance Shootout 2011
Jeff McKenna
MapServer #ProTips 2015
MapServer #ProTips 2015
Jeff McKenna
Day 6 - PostGIS
Day 6 - PostGIS
Barry Jones
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Masayuki Matsushita
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
More Related Content
What's hot
scalable machine learning
scalable machine learning
Samir Bessalah
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Jeff McKenna
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
DonnyV
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
The HDF-EOS Tools and Information Center
Mapserver vs. geoserver
Mapserver vs. geoserver
鸣 饶
WMS Performance Shootout 2010
WMS Performance Shootout 2010
Jeff McKenna
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Kohei KaiGai
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
JMA_447
hadoop
hadoop
longhao
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
JMA_447
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Hongtao Cai
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
InfluxData
WMS Performance Shootout 2011
WMS Performance Shootout 2011
Jeff McKenna
MapServer #ProTips 2015
MapServer #ProTips 2015
Jeff McKenna
Day 6 - PostGIS
Day 6 - PostGIS
Barry Jones
What's hot
(20)
scalable machine learning
scalable machine learning
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
Mapserver vs. geoserver
Mapserver vs. geoserver
WMS Performance Shootout 2010
WMS Performance Shootout 2010
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
hadoop
hadoop
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
WMS Performance Shootout 2011
WMS Performance Shootout 2011
MapServer #ProTips 2015
MapServer #ProTips 2015
Day 6 - PostGIS
Day 6 - PostGIS
Similar to COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Masayuki Matsushita
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
Kohei KaiGai
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
Pradeep Kumar
Golang in TiDB (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
PgconfSV compression
PgconfSV compression
Anastasia Lubennikova
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
Jaehyeuk Oh
Oracle Basics and Architecture
Oracle Basics and Architecture
Sidney Chen
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
bigdatagurus_meetup
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
Ntino Krampis
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]
Igor Lozynskyi
Apache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
Explain this!
Explain this!
Fabio Telles Rodriguez
Similar to COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
(20)
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
Golang in TiDB (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PgconfSV compression
PgconfSV compression
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
Oracle Basics and Architecture
Oracle Basics and Architecture
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]
Apache Cassandra at Macys
Apache Cassandra at Macys
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Explain this!
Explain this!
Recently uploaded
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
ranjana rawat
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
shivangimorya083
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
jennyeacort
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
thyngster
How we prevented account sharing with MFA
How we prevented account sharing with MFA
Andrei Kaleshka
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
FurkanTasci3
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Sapana Sha
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
Boston Institute of Analytics
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Human37
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
Sapana Sha
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
Emmanuel Dauda
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Florian Roscheck
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
Pramod Kumar Srivastava
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
F La
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
ThinkInnovation
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
ccctableauusergroup
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
Rafezzaman
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
sapnasaifi408
Recently uploaded
(20)
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
How we prevented account sharing with MFA
How we prevented account sharing with MFA
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
1.
BDAS:&Comparison&of&Big&Data&Analytics& Systems:&Redshift&&&Hive Prashant(Ratnaparkhi,(Insight(Data(Engineering( Fellowship,(New(York
2.
Hive%setup:%2%AWS%clusters%with: Redshift%setup:%One%Amazon%Redshift%cluster%with r3.Large((2vCPU,(6.5(ECU,(15(GiB memory)(6(instances 1(Name(&(5(Data(nodes 50(GB r3.Large((2vCPU,(6.5(ECU,(15(GiB memory)(6(instances 1(Name(&(5(Data(nodes 500(GB dc1.Large((2vCPU,(7(ECU,(15(GiB memory)(6(instances 1(Leader(&(5(Compute(nodes 50(GB 500(GB
3.
Schema&and&table&size&summary • Schema(based(on(TPC(Big(data(Benchmark((TPCRBB) • 23(tables(representing(a(data(warehouse(of(a(large(retailer( Table
# Records-in-millions [Total-dataset-50-GB] #-Records-in-millions [Total-dataset-500-GB] customer 0.70 2.2 item 0.12 0.40 web_sales 50.1 595.6 store_sales 50.1 595.6 web_clickstreams 512.9 6035.5 …
4.
Two&Sample&Queries # Query Operations 1
CustomerRpurchase(behavior: List(customers,(who(view(online,( followed(by(an(inRstore(purchase • JOIN(web_clickstreams,(item)(! Web • JOIN(store,items)R>(Store • JOIN(Web,(Store) • ORDER(BY • DISTINCT 2 Market(basket analysis:(List(the(items( sold(frequently(together,(in(certain( stores,(categories • JOIN((store_sales,store_sales)(R>(Pairs • JOIN((Pairs,(item) • COUNT • GROUP(BY,(HAVING • ORDER BY
5.
Summary&of&Results&of&selected&queries Execution&time&in&seconds&on&log&scale,&for&outGofGbox&setup: 0 1 2 3 Query(1 Query(2 50(GB(Dataset Hive Redshift 0 2 4 Query(1
Query(2 500(GB(Dataset Hive Redshift 0 2 4 Query(1 Query(2 500(GB(Dataset Hive Redshift
6.
Query&Processing&in&a&nutshell Client( Query Execution(Engine MapRReduce( jobs( NameNode DataNodes Meta( store Execution(Engine Compiled(C++ Compute%Nodes Leader%Node Use%of%MPP Hive Redshift
7.
Redshift&Performance&Improvement • ‘Explain(‘sqlRstatement’(– gives(query(plan •
Look(for(DS_BCAST_*,(DS_DIST_*(steps,(to(identify(data(movement • Use(distkey while(table(creation,(to(indicate(data(coRlocation([Only(one(distkey per( table(allowed] Table-store_sales Table-Item Tableweb_clickstreams ss_item_sk(distkey) i_item_sk(distkey,(sortkey) wcs_item_sk(distkey) ss_customer_sk i_item_id wcs_sales_sk ss_ticket_number(sortkey) i_brand wcs_click_date_sk(sortkey) …. …. …
8.
Redshift&distkey illustration Execution(Engine Optimizer Compute%Nodes:%Data%distribution%KEY {store_sales.ss_item_sk=item.i_item_sk}% {web_clickstreams.wss_item_sk=item.i_item_sk}% CoOlocated%together%during%load At%run%time: No%data%movement%on%join%of%store_sales,%item No%data%movement%on%join%of%web_clickstreams,%item Leader%Node With(item_sk as(distkey Execution(Engine Optimizer: code(for(data( movement Compute%Nodes:%Data%distribution%EVEN%(default),% round%robin%during%load At%run%time: Data%movement%on%join%of%store_sales,%item Data%movement%on%join%of%web_clickstreams,%item Leader%Node With(no(distkey Query
9.
Hive&Performance&Improvement:&sparkGsql Client( Query NameNode DataNodes Hive Meta( storeSpark%core sparkOsql
10.
Summary&of&Results&of&selected&queries Execution&time&in&seconds&on&log&scale,&for&outGofGbox&&setup&&&after&improvements: 0 1 2 3 Query(1 Query(2 50(GB(Dataset Hive SparkRSQL
Redshift RedshiftRD 0 1 2 3 4 Query(1 Query(2 500(GB(Dataset Hive SparkRSQL Redshift RedshiftRD
11.
Cost&of&this&benchmark&execution # Item Duration-in- hours Price
node-per- hour-in-USD Cost-in- USD 1 Redshift(cluster:(6(dc1.large(nodes(1(leader,(5( compute) 408 $0.250 $102.00 2 Hive(Cluster:(12(m3.large(nodes((2(master,6( data)( 384 $0.133 $51.07 3 Hive(Cluster:(12(r3.large(nodes((2(master,6( data)( 96 $0.166 $15.94 Total(cost(of(Hiveclusters $67.01 Total(cost of(Redshift(cluster $102.00 GrandTotal $169.01
12.
Redshift&&&Hive&Use&cases If… Then use
Examples Requirements: • ease(of(use • high performance(outRofRbox • scalability(in(TB(range(good(priceRperformance • low(maintenance(overhead Redshift • Companies(with(SaaS(offeringsin( Analytics(space( • Most(startups Requirements:( • replacement(of(traditional(DWplatforms • replacement(of(existing(infrastructure(with(cloud • good(priceRperformance(in(TB(range Redshift • Large(enterprises(–Finra,(NTT( Docomo Requirements:( • massive(amount(of(data(–in(PB(range • customization/flexibility • no(vendor(lockRin(( • inRhouse(data(due(to(regulatory(reasons Hive • Facebook • Sears( • companies(with(existing(big data( infrastructure,(staff(
13.
Prashant&Ratnaparkhi:&Technologist&with&experience&in&software& product&development • Masters((Systems(&(Control(Engineering),(IIT,(Mumbai • Data(Scientist(Certification(from(Johns(Hopkins/Coursera •
Enjoy(running,(skiing,(swimming,(biking,(sudoku &(learning
14.
Appendix:&Supporting&slides
15.
Summary&of&Results&of&selected&queries Task&execution&time,&for&outGofGtheGbox&setup,&is&listed&in&the&table&below& Task Hive-(seconds) RedShift-(seconds)
Dataset-(GB) Query1 301 3.00 50(GB Query1 2215 19.18 500(GB Query2 346 4.15 50(GB Query2 1639 22.52 500(GB
16.
Results&of&selected&queries&after&tuning Task&execution&time&is&listed&in&the&table&below& Task Hive-(seconds) RedShift- (seconds) Dataset-(GB)
SparkOsql (seconds) Redshift-(with-distkey,- sortkey)-seconds Query1 301 3.00 50(GB 203((down(33%) 3.00((no change)( Query1 2215 19.18 500(GB 1901((down(14%) 14.86(((down 22%) Query2 346 4.15 50(GB 79((down(77%) 2.25((down(46%) Query2 1639 22.52 500(GB 729((down(56%) 13.95((down(38%)
17.
Summary&of&Results Task(execution(time(in(seconds,( for(outRofRtheRbox(setup,(is(listed(in(the(table(below((except(for(‘Load(Data’)( Task Hive
RedShift Dataset LoadData* 34(Min:(44(Sec 41 Min:(35(Sec ** 50(GB LoadData* 3 Hr:(33(Min:(14(Sec 8Hr:25(Min(** 500(GB Query1 691 4.69 50(GB Query1 4701 4.71 500(GB Query2 301 3.00 50(GB Query2 2215 19.18 500(GB Query3 659 9.01 50(GB Query3 4285 14.52 500(GB Query4 612 3.66 50(GB Query4 4563 8.12 500(GB Query5 346 4.15 50(GB Query5 1639 22.52 500(GB *Load(Data(means(populating(metastore for(Hive(&(copying(data(from(S3(in(case(of(RedShift.( **(With(three(tables((store_sales,(item,(web_clickstream using(distkey &(sortkey,(the(‘Load( Data’(time(increased(to(approx.(50(Min(&(11(Hours(respectively(for(Redshift.(
18.
Query&2 • Q2(RR Find(all(customers(who(viewed(items(of(a(given(category(on(the(webRR
in(a(given(month( and(year(that(was(followed( by(an(inRstore( purchase(of(an(item( from(the(same(category(in(the(threeRR consecutive( months. • SELECT(DISTINCT(wcs_user_sk RR Find(all(customers • FROM(((RR web_clicks viewed(items(in(date(range(with(items(from(specified(categories( ( • SELECT((((wcs_user_sk,((((wcs_click_date_sk • FROM(web_clickstreams,(item(( • WHERE(wcs_click_date_sk BETWEEN(37134(AND((37134(+(30)(RR in(a(given(month(and(year(( • AND(i_category IN(('Books',('Electronics')( RR filter(given(category(( • AND(wcs_item_sk =(i_item_sk AND(wcs_user_sk IS(NOT(NULL( (AND(wcs_sales_sk IS(NULL(RRonly(views,(not(purchases • )((webInRange,(((RR store(sales(in(date(range(with(items(from(specified( categories(( • SELECT((((ss_customer_sk,((((ss_sold_date_sk • FROM(store_sales,(item(( • WHERE(ss_sold_date_sk BETWEEN(37134(AND((37134(+(90)(RR in(the(three(consecutive(months.(( • AND(i_category IN(('Books',('Electronics')( RR filter(given(category(( • AND(ss_item_sk =(i_item_sk AND(ss_customer_sk IS(NOT(NULL • )(storeInRange RR join(web(and(store • WHERE(wcs_user_sk =(ss_customer_sk AND(wcs_click_date_sk <(ss_sold_date_sk RR buy(AFTER(viewed(on(website • ORDER(BY(wcs_user_sk;
19.
Query&5 • RR Q5(Items(sold(together(
(frequently( i.e.(more(that(50(times),(in(certain(store,RR with(specified( categories(&(limit(to(100. • select(pairs.itm1,( pairs.itm2,(count(*)( as(cnt • from(( • ( • select(soldTogether.itm1,( soldTogether.itm2( ( • from(((( • ( • select(ss1.ss_item_sk(as(itm1,(ss2.ss_item_sk(as(itm2((((( • from(store_sales ss1,(store_sales ss2((((( • where((ss1.ss_ticket_number(=(ss2.ss_ticket_number( and(ss1.ss_item_sk(<(ss2.ss_item_sk( • AND(ss1.ss_store_sk(IN((10,(20,(33,(40,(50)) • )(as(soldTogether,( item(i • where(soldTogether.itm1( =(i.i_item_sk AND(i.i_category_id IN((1,(2,(3))(as(pairs( • group(by(pairs.itm1,( pairs.itm2 • having(cnt >(50( • order( by(cnt desc,(pairs.itm1,( pairs.itm2 • limit(100;
20.
Description&of&Schema&&&Queries • Schema(&(Queries(are(based(on(TPC(Bigdata Benchmark((TPCRBB) •
Schema(consists(of(23(tables(representing(a(data(warehouse(of(a(large( retailer(with(multiple(stores(and(large(online(sales(presence,(with(both( structured(&(semiRunstructured(data(such(as(web_clickstreams • Find(all(customers(who(viewed(items(of(a(given(category(on(the(webRR in( a(given(month(and(year(that(was(followed(by(an(inRstore(purchase(of(an( item(from(the(same(category(in(the(three(RR consecutive(months((Q2).( • List(items(sold(together((frequently(i.e.(more(that(50(times),(in(certain( stores((with(specified(categories)(&(limit(to(100((Q5)
21.
Redshift&&&Hive&Features&comparison • Coming(up