Submit Search
Upload
COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
•
0 likes
•
282 views
AI-enhanced title
Prashant Ratnaparkhi
Follow
Redshift & Hive Comparison
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 21
Recommended
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
Ordered Record Collection
Ordered Record Collection
Hadoop User Group
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
Intro To PostGIS
Intro To PostGIS
mleslie
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
Recommended
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
Kohei KaiGai
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
Kohei KaiGai
Ordered Record Collection
Ordered Record Collection
Hadoop User Group
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
Tom Croucher
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
Intro To PostGIS
Intro To PostGIS
mleslie
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
Michael Stack
scalable machine learning
scalable machine learning
Samir Bessalah
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Jeff McKenna
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
DonnyV
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
The HDF-EOS Tools and Information Center
Mapserver vs. geoserver
Mapserver vs. geoserver
鸣 饶
WMS Performance Shootout 2010
WMS Performance Shootout 2010
Jeff McKenna
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Kohei KaiGai
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
JMA_447
hadoop
hadoop
longhao
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
JMA_447
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Hongtao Cai
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
InfluxData
WMS Performance Shootout 2011
WMS Performance Shootout 2011
Jeff McKenna
MapServer #ProTips 2015
MapServer #ProTips 2015
Jeff McKenna
Day 6 - PostGIS
Day 6 - PostGIS
Barry Jones
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Masayuki Matsushita
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
More Related Content
What's hot
scalable machine learning
scalable machine learning
Samir Bessalah
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Jeff McKenna
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
DonnyV
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
The HDF-EOS Tools and Information Center
Mapserver vs. geoserver
Mapserver vs. geoserver
鸣 饶
WMS Performance Shootout 2010
WMS Performance Shootout 2010
Jeff McKenna
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Kohei KaiGai
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
JMA_447
hadoop
hadoop
longhao
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Cloudflare
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
JMA_447
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Hongtao Cai
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
InfluxData
WMS Performance Shootout 2011
WMS Performance Shootout 2011
Jeff McKenna
MapServer #ProTips 2015
MapServer #ProTips 2015
Jeff McKenna
Day 6 - PostGIS
Day 6 - PostGIS
Barry Jones
What's hot
(20)
scalable machine learning
scalable machine learning
WMS Performance Shootout 2009
WMS Performance Shootout 2009
Wms Performance Tests Map Server Vs Geo Server
Wms Performance Tests Map Server Vs Geo Server
Supporting HDF5 in GrADS
Supporting HDF5 in GrADS
Mapserver vs. geoserver
Mapserver vs. geoserver
WMS Performance Shootout 2010
WMS Performance Shootout 2010
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Guide for visualizing JMA's GSM outputs using GrADS
Guide for visualizing JMA's GSM outputs using GrADS
hadoop
hadoop
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Cloud flare jgc bigo meetup rolling hashes
Cloud flare jgc bigo meetup rolling hashes
Jma hr gsm_data_gr_ads_20130529
Jma hr gsm_data_gr_ads_20130529
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
CGPOP Analysis and Optimization
CGPOP Analysis and Optimization
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
Obtaining the Perfect Smoke By Monitoring Your BBQ with InfluxDB and Telegraf
WMS Performance Shootout 2011
WMS Performance Shootout 2011
MapServer #ProTips 2015
MapServer #ProTips 2015
Day 6 - PostGIS
Day 6 - PostGIS
Similar to COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Masayuki Matsushita
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
Dongwon Kim
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
t_ivanov
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
Kohei KaiGai
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
Pradeep Kumar
Golang in TiDB (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PingCAP
PgconfSV compression
PgconfSV compression
Anastasia Lubennikova
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
Jaehyeuk Oh
Oracle Basics and Architecture
Oracle Basics and Architecture
Sidney Chen
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
bigdatagurus_meetup
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
Ntino Krampis
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]
Igor Lozynskyi
Apache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
Explain this!
Explain this!
Fabio Telles Rodriguez
Similar to COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
(20)
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
Golang in TiDB (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
PgconfSV compression
PgconfSV compression
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
하이퍼커넥트 데이터 팀이 데이터 증가에 대처해온 기록
Oracle Basics and Architecture
Oracle Basics and Architecture
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
RxJava applied [JavaDay Kyiv 2016]
RxJava applied [JavaDay Kyiv 2016]
Apache Cassandra at Macys
Apache Cassandra at Macys
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Explain this!
Explain this!
Recently uploaded
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
Neil Barnes
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
hf8803863
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
Suhani Kapoor
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Sonatrach
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
atducpo
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Lars Albertsson
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
Sapana Sha
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
thyngster
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
Emmanuel Dauda
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
shivangimorya083
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Suhani Kapoor
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Social Samosa
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Jack DiGiovanna
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
Stephen266013
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Invezz1
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
Suhani Kapoor
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
YohFuh
Recently uploaded
(20)
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
COMPARING BIG DATA ANALYTICS: REDSHIFT VS HIVE
1.
BDAS:&Comparison&of&Big&Data&Analytics& Systems:&Redshift&&&Hive Prashant(Ratnaparkhi,(Insight(Data(Engineering( Fellowship,(New(York
2.
Hive%setup:%2%AWS%clusters%with: Redshift%setup:%One%Amazon%Redshift%cluster%with r3.Large((2vCPU,(6.5(ECU,(15(GiB memory)(6(instances 1(Name(&(5(Data(nodes 50(GB r3.Large((2vCPU,(6.5(ECU,(15(GiB memory)(6(instances 1(Name(&(5(Data(nodes 500(GB dc1.Large((2vCPU,(7(ECU,(15(GiB memory)(6(instances 1(Leader(&(5(Compute(nodes 50(GB 500(GB
3.
Schema&and&table&size&summary • Schema(based(on(TPC(Big(data(Benchmark((TPCRBB) • 23(tables(representing(a(data(warehouse(of(a(large(retailer( Table
# Records-in-millions [Total-dataset-50-GB] #-Records-in-millions [Total-dataset-500-GB] customer 0.70 2.2 item 0.12 0.40 web_sales 50.1 595.6 store_sales 50.1 595.6 web_clickstreams 512.9 6035.5 …
4.
Two&Sample&Queries # Query Operations 1
CustomerRpurchase(behavior: List(customers,(who(view(online,( followed(by(an(inRstore(purchase • JOIN(web_clickstreams,(item)(! Web • JOIN(store,items)R>(Store • JOIN(Web,(Store) • ORDER(BY • DISTINCT 2 Market(basket analysis:(List(the(items( sold(frequently(together,(in(certain( stores,(categories • JOIN((store_sales,store_sales)(R>(Pairs • JOIN((Pairs,(item) • COUNT • GROUP(BY,(HAVING • ORDER BY
5.
Summary&of&Results&of&selected&queries Execution&time&in&seconds&on&log&scale,&for&outGofGbox&setup: 0 1 2 3 Query(1 Query(2 50(GB(Dataset Hive Redshift 0 2 4 Query(1
Query(2 500(GB(Dataset Hive Redshift 0 2 4 Query(1 Query(2 500(GB(Dataset Hive Redshift
6.
Query&Processing&in&a&nutshell Client( Query Execution(Engine MapRReduce( jobs( NameNode DataNodes Meta( store Execution(Engine Compiled(C++ Compute%Nodes Leader%Node Use%of%MPP Hive Redshift
7.
Redshift&Performance&Improvement • ‘Explain(‘sqlRstatement’(– gives(query(plan •
Look(for(DS_BCAST_*,(DS_DIST_*(steps,(to(identify(data(movement • Use(distkey while(table(creation,(to(indicate(data(coRlocation([Only(one(distkey per( table(allowed] Table-store_sales Table-Item Tableweb_clickstreams ss_item_sk(distkey) i_item_sk(distkey,(sortkey) wcs_item_sk(distkey) ss_customer_sk i_item_id wcs_sales_sk ss_ticket_number(sortkey) i_brand wcs_click_date_sk(sortkey) …. …. …
8.
Redshift&distkey illustration Execution(Engine Optimizer Compute%Nodes:%Data%distribution%KEY {store_sales.ss_item_sk=item.i_item_sk}% {web_clickstreams.wss_item_sk=item.i_item_sk}% CoOlocated%together%during%load At%run%time: No%data%movement%on%join%of%store_sales,%item No%data%movement%on%join%of%web_clickstreams,%item Leader%Node With(item_sk as(distkey Execution(Engine Optimizer: code(for(data( movement Compute%Nodes:%Data%distribution%EVEN%(default),% round%robin%during%load At%run%time: Data%movement%on%join%of%store_sales,%item Data%movement%on%join%of%web_clickstreams,%item Leader%Node With(no(distkey Query
9.
Hive&Performance&Improvement:&sparkGsql Client( Query NameNode DataNodes Hive Meta( storeSpark%core sparkOsql
10.
Summary&of&Results&of&selected&queries Execution&time&in&seconds&on&log&scale,&for&outGofGbox&&setup&&&after&improvements: 0 1 2 3 Query(1 Query(2 50(GB(Dataset Hive SparkRSQL
Redshift RedshiftRD 0 1 2 3 4 Query(1 Query(2 500(GB(Dataset Hive SparkRSQL Redshift RedshiftRD
11.
Cost&of&this&benchmark&execution # Item Duration-in- hours Price
node-per- hour-in-USD Cost-in- USD 1 Redshift(cluster:(6(dc1.large(nodes(1(leader,(5( compute) 408 $0.250 $102.00 2 Hive(Cluster:(12(m3.large(nodes((2(master,6( data)( 384 $0.133 $51.07 3 Hive(Cluster:(12(r3.large(nodes((2(master,6( data)( 96 $0.166 $15.94 Total(cost(of(Hiveclusters $67.01 Total(cost of(Redshift(cluster $102.00 GrandTotal $169.01
12.
Redshift&&&Hive&Use&cases If… Then use
Examples Requirements: • ease(of(use • high performance(outRofRbox • scalability(in(TB(range(good(priceRperformance • low(maintenance(overhead Redshift • Companies(with(SaaS(offeringsin( Analytics(space( • Most(startups Requirements:( • replacement(of(traditional(DWplatforms • replacement(of(existing(infrastructure(with(cloud • good(priceRperformance(in(TB(range Redshift • Large(enterprises(–Finra,(NTT( Docomo Requirements:( • massive(amount(of(data(–in(PB(range • customization/flexibility • no(vendor(lockRin(( • inRhouse(data(due(to(regulatory(reasons Hive • Facebook • Sears( • companies(with(existing(big data( infrastructure,(staff(
13.
Prashant&Ratnaparkhi:&Technologist&with&experience&in&software& product&development • Masters((Systems(&(Control(Engineering),(IIT,(Mumbai • Data(Scientist(Certification(from(Johns(Hopkins/Coursera •
Enjoy(running,(skiing,(swimming,(biking,(sudoku &(learning
14.
Appendix:&Supporting&slides
15.
Summary&of&Results&of&selected&queries Task&execution&time,&for&outGofGtheGbox&setup,&is&listed&in&the&table&below& Task Hive-(seconds) RedShift-(seconds)
Dataset-(GB) Query1 301 3.00 50(GB Query1 2215 19.18 500(GB Query2 346 4.15 50(GB Query2 1639 22.52 500(GB
16.
Results&of&selected&queries&after&tuning Task&execution&time&is&listed&in&the&table&below& Task Hive-(seconds) RedShift- (seconds) Dataset-(GB)
SparkOsql (seconds) Redshift-(with-distkey,- sortkey)-seconds Query1 301 3.00 50(GB 203((down(33%) 3.00((no change)( Query1 2215 19.18 500(GB 1901((down(14%) 14.86(((down 22%) Query2 346 4.15 50(GB 79((down(77%) 2.25((down(46%) Query2 1639 22.52 500(GB 729((down(56%) 13.95((down(38%)
17.
Summary&of&Results Task(execution(time(in(seconds,( for(outRofRtheRbox(setup,(is(listed(in(the(table(below((except(for(‘Load(Data’)( Task Hive
RedShift Dataset LoadData* 34(Min:(44(Sec 41 Min:(35(Sec ** 50(GB LoadData* 3 Hr:(33(Min:(14(Sec 8Hr:25(Min(** 500(GB Query1 691 4.69 50(GB Query1 4701 4.71 500(GB Query2 301 3.00 50(GB Query2 2215 19.18 500(GB Query3 659 9.01 50(GB Query3 4285 14.52 500(GB Query4 612 3.66 50(GB Query4 4563 8.12 500(GB Query5 346 4.15 50(GB Query5 1639 22.52 500(GB *Load(Data(means(populating(metastore for(Hive(&(copying(data(from(S3(in(case(of(RedShift.( **(With(three(tables((store_sales,(item,(web_clickstream using(distkey &(sortkey,(the(‘Load( Data’(time(increased(to(approx.(50(Min(&(11(Hours(respectively(for(Redshift.(
18.
Query&2 • Q2(RR Find(all(customers(who(viewed(items(of(a(given(category(on(the(webRR
in(a(given(month( and(year(that(was(followed( by(an(inRstore( purchase(of(an(item( from(the(same(category(in(the(threeRR consecutive( months. • SELECT(DISTINCT(wcs_user_sk RR Find(all(customers • FROM(((RR web_clicks viewed(items(in(date(range(with(items(from(specified(categories( ( • SELECT((((wcs_user_sk,((((wcs_click_date_sk • FROM(web_clickstreams,(item(( • WHERE(wcs_click_date_sk BETWEEN(37134(AND((37134(+(30)(RR in(a(given(month(and(year(( • AND(i_category IN(('Books',('Electronics')( RR filter(given(category(( • AND(wcs_item_sk =(i_item_sk AND(wcs_user_sk IS(NOT(NULL( (AND(wcs_sales_sk IS(NULL(RRonly(views,(not(purchases • )((webInRange,(((RR store(sales(in(date(range(with(items(from(specified( categories(( • SELECT((((ss_customer_sk,((((ss_sold_date_sk • FROM(store_sales,(item(( • WHERE(ss_sold_date_sk BETWEEN(37134(AND((37134(+(90)(RR in(the(three(consecutive(months.(( • AND(i_category IN(('Books',('Electronics')( RR filter(given(category(( • AND(ss_item_sk =(i_item_sk AND(ss_customer_sk IS(NOT(NULL • )(storeInRange RR join(web(and(store • WHERE(wcs_user_sk =(ss_customer_sk AND(wcs_click_date_sk <(ss_sold_date_sk RR buy(AFTER(viewed(on(website • ORDER(BY(wcs_user_sk;
19.
Query&5 • RR Q5(Items(sold(together(
(frequently( i.e.(more(that(50(times),(in(certain(store,RR with(specified( categories(&(limit(to(100. • select(pairs.itm1,( pairs.itm2,(count(*)( as(cnt • from(( • ( • select(soldTogether.itm1,( soldTogether.itm2( ( • from(((( • ( • select(ss1.ss_item_sk(as(itm1,(ss2.ss_item_sk(as(itm2((((( • from(store_sales ss1,(store_sales ss2((((( • where((ss1.ss_ticket_number(=(ss2.ss_ticket_number( and(ss1.ss_item_sk(<(ss2.ss_item_sk( • AND(ss1.ss_store_sk(IN((10,(20,(33,(40,(50)) • )(as(soldTogether,( item(i • where(soldTogether.itm1( =(i.i_item_sk AND(i.i_category_id IN((1,(2,(3))(as(pairs( • group(by(pairs.itm1,( pairs.itm2 • having(cnt >(50( • order( by(cnt desc,(pairs.itm1,( pairs.itm2 • limit(100;
20.
Description&of&Schema&&&Queries • Schema(&(Queries(are(based(on(TPC(Bigdata Benchmark((TPCRBB) •
Schema(consists(of(23(tables(representing(a(data(warehouse(of(a(large( retailer(with(multiple(stores(and(large(online(sales(presence,(with(both( structured(&(semiRunstructured(data(such(as(web_clickstreams • Find(all(customers(who(viewed(items(of(a(given(category(on(the(webRR in( a(given(month(and(year(that(was(followed(by(an(inRstore(purchase(of(an( item(from(the(same(category(in(the(three(RR consecutive(months((Q2).( • List(items(sold(together((frequently(i.e.(more(that(50(times),(in(certain( stores((with(specified(categories)(&(limit(to(100((Q5)
21.
Redshift&&&Hive&Features&comparison • Coming(up