Your SlideShare is downloading. ×
  • Like
Hive Correlation Optimizer
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hive Correlation Optimizer

  • 2,042 views
Published

Presented at Hadoop Summit 2013 Hive User Group Meetup …

Presented at Hadoop Summit 2013 Hive User Group Meetup

Published in Technology , Art & Photos
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,042
On SlideShare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
93
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © Hortonworks Inc. 2011Hive Correlation OptimizerYin Huaiyhuai@hortonworks.comhuai@cse.ohio-state.eduPage 1Hadoop Summit 2013 Hive User Group Meetup
  • 2. © Hortonworks Inc. 2011About me•Hive contributor•Summer intern at Hortonworks•4th year Ph.D. student at The Ohio StateUniversity•Research interests: query optimizations, fileformats, distributed systems, and storagesystemsPage 2Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011Outline•Query planning in Hive•Correlations in a query (Intra-querycorrelations)•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)Page 3Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011Query planningPage 4Architecting the Future of Big DataSELECT t1.c2, count(*)FROM t1 JOIN t2 ON (t1.c1=t2.c1)GROUP BY t1.c2t1 t2JOINAGGt1.c1=t2.c1Calculate count(*) forevery group of t1.c2
  • 5. © Hortonworks Inc. 2011Query planningPage 5Architecting the Future of Big DataSELECT t1.c2, count(*)FROM t1 JOIN t2 ON (t1.c1=t2.c1)GROUP BY t1.c2t1 t2JOINAGG Evaluate this query indistributed systemst1 t2JOINAGGShuffleShufflec1c2How to shuffle?Use the key column(s)
  • 6. © Hortonworks Inc. 2011Generating MapReduce jobsPage 6Architecting the Future of Big Datat1 t2JOINAGGShuffleShuffle c2c1t1 t2JOINShuffletmpc1tmpAGGShuffle c21 MR job can shuffledata onceJob 1Job 2
  • 7. © Hortonworks Inc. 2011Generating MapReduce jobsPage 7Architecting the Future of Big Datat1 t2JOINShuffletmpc1tmpAGGShuffle c2MapReuce will shuffledata for us, we justneed to emit outputsfrom the Map phaseWe use ReduceSinkOperator(RS) to emit Map outputs.RSs are the end of a Map phase.t1 t2JOINtmptmpAGGRS1 RS2RS2Job 1MapJob 1ReduceJob 2MapJob 2Reduce
  • 8. © Hortonworks Inc. 2011Outline•Query planning in Hive•Correlations in a query (Intra-querycorrelations)•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)Page 8Architecting the Future of Big Data
  • 9. © Hortonworks Inc. 2011Intra-query correlationsPage 9Architecting the Future of Big DataSELECT x.c1, count(*)FROM t1 x JOIN t1 y ON (x.c1=y.c1)GROUP BY x.c1t1 as x t1 as yJOINAGGx.c1=y.c1Calculate count(*) forevery group of x.c1Correlations:1. Same input tables2. JOIN and AGG using thesame key
  • 10. © Hortonworks Inc. 2011Intra-query correlationsPage 10Architecting the Future of Big Datax.c1=y.c1Calculate count(*)for every group ofz.c1t1 as x t2 as yJOIN1JOIN2AGG1t1 as zp.c1=q.c1SELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)Correlations:1. Same input tables (t1)2. JOIN1 and AGG1 using thesame key3. JOIN2 and all of its parentsusing the same key
  • 11. © Hortonworks Inc. 2011Intra-query correlations• Defined in “YSmart: Yet Another SQL-to-MapReduce Translator”– http://ysmart.cse.ohio-state.edu/– http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf• Targeting on operators which need to shuffle the data and inputs• Three kinds of correlations– Input correlation (IC): independent operators share the same input tables– Transit correlation (TC): independent operators have input correlation andalso shuffle the data in the same way (e.g. using the same keys)– Job flow correlation (JFC): two dependent operators shuffle the data inthe same wayPage 11Architecting the Future of Big Datat1 as x t2 as yJOIN1 AGG1t1 as zICt1 as x t2 as yJOIN1 AGG1t1 as zx.c1=y.c1 group by z.c1TCJOINAGGx.c1=y.c1group by z.c1JFC
  • 12. © Hortonworks Inc. 2011Correlation-unaware query planningPage 12Architecting the Future of Big Datat1 t1JOINAGGShuffleShuffle c1c1Hive does not care:1. If a table has beenused multipletimes2. If data really needsto be shuffledt1 t1JOINShuffletmpc1Job 1tmpAGGShuffle c1 Job 2Drawbacks:1. Unnecessary dataloading2. Unnecessary datashuffling3. Unnecessary datamaterialization
  • 13. © Hortonworks Inc. 2011Outline•Query planning in Hive•Correlations in a query (Intra-querycorrelations)•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)Page 13Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011Case studies: TPC-H Q17 (Flattened)SELECTsum(l_extendedprice) / 7.0 as avg_yearlyFROM(SELECT l_partkey, l_quantity, l_extendedpriceFROM lineitem JOIN part ON (p_partkey=l_partkey)WHERE p_brand=Brand#35’ ANDp_container = MED PKG’) touterJOIN(SELECT l_partkey as lp, 0.2 * avg(l_quantity) as lqFROM lineitemGROUP BY l_partkey) tinnerON (touter.l_partkey = tinnter.lp)WHERE touter.l_quantity < tinner.lqPage 14Architecting the Future of Big Data
  • 15. © Hortonworks Inc. 2011Case studies: TPC-H Q17 (Flattened)Page 15Architecting the Future of Big Datalineitem partJOIN1JOIN2AGG1lineitemAGG2lineitem is used by JOIN1 and AGG1JOIN1, AGG1, and JOIN2 share the same key
  • 16. © Hortonworks Inc. 2011Case studies: TPC-H Q17 (Flattened)Page 16Architecting the Future of Big Datalineitem partJOIN1JOIN2AGG1lineitemAGG2Job 1 Job 2Job 3Job 4Without Correlation Optimizer
  • 17. © Hortonworks Inc. 2011Case studies: TPC-H Q17 (Flattened)Page 17Architecting the Future of Big Datalineitem partJOIN1JOIN2AGG1lineitemAGG2partJOIN1JOIN2AGG1lineitemAGG2Job 1 Job 2Job 3Job 4 Job 2Job 1Without Correlation Optimizer With Correlation Optimizer
  • 18. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)SELECT count(distinct ws1.ws_order_number) as order_count,sum(ws1.ws_ext_ship_cost) as total_shipping_cost,sum(ws1.ws_net_profit) as total_net_profitFROM web_sales ws1JOIN customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk)JOIN web_site s ON (ws1.ws_web_site_sk = s.web_site_sk)JOIN date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk)LEFT SEMI JOIN (SELECT ws2.ws_order_number as ws_order_numberFROM web_sales ws2 JOIN web_sales ws3ON(ws2.ws_order_number = ws3.ws_order_number)WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1ON (ws1.ws_order_number = ws_wh1.ws_order_number)LEFT SEMI JOIN (SELECT wr_order_numberFROM web_returns wrJOIN (SELECT ws4.ws_order_number as ws_order_numberFROM web_sales ws4 JOIN web_sales ws5ON (ws4.ws_order_number = ws5.ws_order_number)WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1ON (ws1.ws_order_number = tmp1.wr_order_number)WHERE d.d_date >= 2001-05-01 ANDd.d_date <= 2001-06-30’ ANDca.ca_state = NC’ ANDs.web_company_name = priPage 18Architecting the Future of Big Data
  • 19. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 19Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_sales web_salesJOIN1web_sales web_salesJOIN1web_returnsJOIN2date_dim
  • 20. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 20Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_sales web_salesJOIN1web_sales web_salesJOIN1web_returnsJOIN2Without Correlation Optimizer• 6 MapReduce jobs• Unnecessary data loading(black web_sales nodes)• Unnecessary data shufflingJob 6Job 2Job 3Job 4Job 5Job 1date_dim
  • 21. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 21Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_salesJOIN1JOIN1web_returnsJOIN2With Correlation Optimizer• Black web_sales nodes sharethe same data loadingdate_dim
  • 22. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 22Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_salesJOIN1JOIN1web_returnsJOIN2With Correlation Optimizer• Black web_sales nodes sharethe same data loading• 3 MapReduce jobsJob 1Job 2Job 3date_dim
  • 23. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 23Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_salesJOIN1web_returnsJOIN2Follow-up work• Evaluate JOIN1 only oncewithout materializing atemporary tabledate_dim
  • 24. © Hortonworks Inc. 2011Case studies: TPC-DS Q95 (Flattened)Page 24Architecting the Future of Big Dataweb_salesAGGcustomer_address web_siteMapJoinSemiJoinweb_salesJOIN1web_returnsJOIN2Follow-up work• Evaluate JOIN1 only oncewithout materializing atemporary table• Only use 2 MapReduce jobsJob 1Job 2date_dim
  • 25. © Hortonworks Inc. 2011Outline•Query planning in Hive•Correlations in a query (Intra-querycorrelations)•Case studies•Automatically exploiting correlations (HIVE-2206: Correlation Optimizer)Page 25Architecting the Future of Big Data
  • 26. © Hortonworks Inc. 2011Objectives• Eliminate unnecessary data loading– Query planner will be aware what data will be loaded– Do as many things as possible for loaded data• Eliminate unnecessary data shuffling– Query planner will be aware when data really needs to be shuffled– Do as many things as possible before shuffling the data againPage 26Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011ReduceSink Deduplication• HIVE-2340• Handle chained Job Flow Correlations– e.g. Generating a single job for both Group By and Order By• Cannot handle complex patterns– e.g. Multiple Joins involved patterns• Need a fundamental solution• Need to exploit shared input tablesPage 27Architecting the Future of Big Datat1RS1AGG1RS2…t1RS1AGG1…
  • 28. © Hortonworks Inc. 2011Correlation Optimizer• 2-phase optimizer– Phase 1: Correlation Detection– Phase 2: Query plan tree transformation• This work is not just about the optimizer– New operators to support the execution of an optimized plan– A mechanism to coordinate the operator tree inside the Reduce phasePage 28Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011Correlation detectionPage 29Architecting the Future of Big DataSELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)1. Traverse the tree all the waydown to find matching keysin ReduceSinkOperators2. Then, check input tables tofind shared data loadingopportunitiest1 as x t2 as yJOIN1JOIN2AGG1t1 as zRS1 RS2 RS3RS4 RS5Key: p.c1 Key: q.c1Key: x.c1 Key: y.c1 Key: z.c1
  • 30. © Hortonworks Inc. 2011Query plan tree transformationPage 30Architecting the Future of Big DataSELECT p.c1, q.c2, q.cntFROM (SELECT x.c1 AS c1 FROM t1 x JOIN t2 y ON (x.c1=y.c1)) pJOIN (SELECT z.c1 AS c1, count(*) AS cnt FROM t1 z GROUP BY z.c1) qON (p.c1=q.c1)t1 as x t2 as yJOIN1JOIN2AGG1t1 as zKey: p.c1RS1 RS2 RS3RS4 RS5Key: q.c1Key: x.c1 Key: y.c1 Key: z.c1t1 as x, zt2 as yJOIN1JOIN2AGG1RS1RS2 RS3
  • 31. © Hortonworks Inc. 2011ThanksArchitecting the Future of Big DataPage 31