• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Indexed Hive
 

Indexed Hive

on

  • 15,550 views

Accelerating Hive queries with indexes.

Accelerating Hive queries with indexes.

Statistics

Views

Total Views
15,550
Views on SlideShare
14,434
Embed Views
1,116

Actions

Likes
21
Downloads
450
Comments
2

22 Embeds 1,116

http://socurites.com 499
http://jugnu-life.blogspot.com 393
http://jugnu-life.blogspot.in 140
http://socurites.tistory.com 14
http://jugnu-life.blogspot.kr 10
http://jugnu-life.blogspot.fr 8
http://jugnu-life.blogspot.tw 8
http://jugnu-life.blogspot.jp 8
http://jugnu-life.blogspot.co.uk 6
http://jugnu-life.blogspot.sg 5
http://jugnu-life.blogspot.hk 4
http://jugnu-life.blogspot.ca 4
https://twitter.com 3
http://jugnu-life.blogspot.com.au 3
http://jugnu-life.blogspot.de 3
http://devcafe.nhncorp.com 2
http://www.socurites.com 1
http://jugnu-life.blogspot.ru 1
http://jugnu-life.blogspot.com.es 1
http://jugnu-life.blogspot.nl 1
http://jugnu-life.blogspot.com.br 1
https://www.google.co.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • These queries are not equivalent. The index does not guarantee that there will be only 1 row for each 'l_shipdate' value (I have seen this in practise with hive 0.11). Hence an equivalent query to one on slide 11 would be

    SELECT l_shipdate, sum(size(`_offsets`)
    FROM __lineitem_shipdate_idx__
    GROUP BY l_shipdate;
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Nikhil, Prafulla,
    Very cool work you guys!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Indexed Hive Indexed Hive Presentation Transcript

    • Indexed Hive A quick demonstration of Hive performance acceleration using indexes By: Prafulla Tekawade Nikhil Deshpande www.persistentsys.com
    • Summary • This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. • The slides include information on • Indexes • A specific set of Group By queries • Rewrite technique • Performance experiment and results © 2010 Persistent Systems Ltd www.persistentsys.com 2
    • Hive usage • HDFS spreads and scatters the data to different locations (data nodes). • Data dumped & loaded into HDFS ‘as it is’. • Only one view to the data, original data structure & layout • Typically data is append-only • Processing times dominated by full data scan times Can the data access times be better? © 2010 Persistent Systems Ltd www.persistentsys.com 3
    • Hive usage What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing. Different ways to get performance • Columnar storage • Data partitioning • Indexing (different view of same data) • … © 2010 Persistent Systems Ltd www.persistentsys.com 4
    • Hive Indexing • Provides key-based data view • Keys data duplicated • Storage layout favors search & lookup performance • Provided better data access for certain operations • A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases! © 2010 Persistent Systems Ltd www.persistentsys.com 5
    • How does the index look like? An index is a table with 3 columns hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK l_shipdate string Key _bucketname string References to _offsets array<string> values Data in index looks like hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK 1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"] 1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"] © 2010 Persistent Systems Ltd www.persistentsys.com 6
    • Hive index in HQL • SELECT (mapping, projection, association, given key, fetch value) • WHERE (filters on keys) • GROUP BY (grouping on keys) • JOIN (join key as index key) Indexes have high potential for accelerating wide range of queries. © 2010 Persistent Systems Ltd www.persistentsys.com 7
    • Hive Index • Index as Reference • Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! • Uses Query Rewrite technique to transform queries on base table to index table. • Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. • Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications). © 2010 Persistent Systems Ltd www.persistentsys.com 8
    • Indexes and Query Rewrites Demo targeting: • GROUP BY, aggregation • Index as Data • Group By Key = Index Key • Query rewritten to use indexes, but still a valid query (nothing special in it!) © 2010 Persistent Systems Ltd www.persistentsys.com 9
    • Query Rewrites: simple gb SELECT DISTINCT l_shipdate FROM lineitem; SELECT l_shipdate FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 10
    • Query Rewrites: simple agg SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate; SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 11
    • Query Rewrites: gb + where SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate; SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996; © 2010 Persistent Systems Ltd www.persistentsys.com 12
    • Query Rewrites: gb on func(key) SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate); SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year; © 2010 Persistent Systems Ltd www.persistentsys.com 13
    • Histogram Query SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Monthly_shipments FROM lineitem GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS Monthly_shipments FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); © 2010 Persistent Systems Ltd www.persistentsys.com 14
    • Year on Year Query SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 15
    • Year on Year Query SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 16
    • Performance tests Hardware and software configuration: • 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) • 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row- store format, HDFS replication factor: 2 • Hive development branch (~0.5) • Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) • Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) © 2010 Persistent Systems Ltd www.persistentsys.com 17
    • Perf gain for Histogram Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 24.161 76.79 506.005 1551.555 q1_idx 21.268 27.292 35.502 86.133 © 2010 Persistent Systems Ltd www.persistentsys.com 18
    • Perf gain for Year on Year Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 73.66 130.587 764.619 2146.423 q1_idx 69.393 75.493 92.867 190.619 © 2010 Persistent Systems Ltd www.persistentsys.com 19
    • Why index performs better? Reducing data increases I/O efficiency Exploiting storage layout optimization  If you need only X, separate X from  “Right tool for the job”, e.g. two ways the rest to do GROUP BY  Lesser data to process, better  sort + agg or memory footprint, better locality of  hash & agg reference…  Sort step already done in index! Parallelization • Process the index data in same manner as base table, distribute the processing across nodes • Scalable! © 2010 Persistent Systems Ltd www.persistentsys.com 20
    • Near-by future More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans. … © 2010 Persistent Systems Ltd www.persistentsys.com 21
    • Index Design Hive Hive Query DDL Index Query Rewrite Compiler Builder Compiler Engine Hive Hive DDL Query Engine Engine Hadoop MR HDFS © 2010 Persistent Systems Ltd www.persistentsys.com 22
    • Hive Compiler Parser / AST Generator Semantic Analyzer Optimizer / Operator Query Plan Rewrite Generator Execution Engine Plan Generator To Hadoop MR © 2010 Persistent Systems Ltd www.persistentsys.com 23
    • Query Rewrite Engine Rule Engine Rewritten Query Tree Query Tree Rewrite Rules Repository Rewrite Rule Rewrite Rewrite Rule Rewrite Rule Rewrite Trigger Rewrite Rule Rewrite Action Condition Rewrite Rewrite Trigger Rewrite Rewrite Rule Action Rewrite Condition Trigger Rewrite Rewrite Rule Trigger Rewrite Condition Action Rewrite Action Trigger Rewrite Action Condition Condition Rewrite Trigger Action Condition © 2010 Persistent Systems Ltd www.persistentsys.com 24
    • Learning Hive • Hive compiler is not ‘Syntax Directed Translation’ driven • Tree visitor based, separation of data structs and compiler logic • Tree is immutable (harder to change, harder to rewrite) • Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries. • Optimizer is not yet mature • Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table scans) • Optimizer is rule-based, not cost-based, no stats collected • Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesn’t) • Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). • Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk). © 2010 Persistent Systems Ltd www.persistentsys.com 25
    • How to get it? • Needs a working Hadoop cluster (tested with 0.20.2) • For the Hive with Indexing support: • Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 • Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test. © 2010 Persistent Systems Ltd www.persistentsys.com 26
    • Thank You! prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in © 2010 Persistent Systems Ltd www.persistentsys.com 27