Column Statistics in Hive


Published on

Published in: Spiritual
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Explain what is cost in the context of this discussion – CPU and I/O cost of executing a query plan
  • Talk about how each one of the stats will be useful
  • Talk about algorithms usedFlajolet-Martin, Histogram construction
  • Column Statistics in Hive

    1. 1. 09/27/2012Column Statistics ProjectShreepadma Venugopalan | Platform Engineering
    2. 2. Outline • Motivation • New Statistics • Computing and Persisting Statistics • Summary • Further Readings2 ©2012 Cloudera, Inc. All Rights Reserved.
    3. 3. Why Column Statistics? • RDBMS Query Optimizer • Is cost-based • Uses the statistical properties of the data to cost alternate execution plans • Picks executions plans with the lowest cost • Hive Query Optimizer • Is rule-based • Uses rules of thumb to optimize the execution plan • Unable to always pick the most efficient execution plan3 ©2012 Cloudera, Inc. All Rights Reserved.
    4. 4. Why Column Statistics?• Statistics in an RDBMS • Maintained on per table, per partition, and per column basis • Used for a wide range of cost based query optimizations• Statistics in Hive • Maintained on per table and per partition level • Can be used to perform some cost based optimizations such as choosing join method etc. • Insufficient for other cost based optimizations such as join reordering, two stage aggregation etc. Solution: Maintain statistics on columns in Hive4 ©2012 Cloudera, Inc. All Rights Reserved.
    5. 5. What are the New Statistics? • Min Column Value • Max Column Value • Average Length of Column Value • Max Length of Column Value • Number of Distinct Values in a Column • Number of Null Values in a Column • Equi-height Histograms5 ©2012 Cloudera, Inc. All Rights Reserved.
    6. 6. How to Compute Column Statistics? • Explicit Computation • Triggered through an ANALYZE command • Pros: Admin has fine grained control over the stats job • Cons: Doesn’t piggyback on other operations such as scan • Implicit Computation • Incrementally compute statistics while loading data • Pros: Avoid an additional table scan, more efficient than explicit computation • Cons: Impacts LOAD performance6 ©2012 Cloudera, Inc. All Rights Reserved.
    7. 7. How to Compute Column Statistics? • Aggregate function of the column data rolled up by table/partition • Fits nicely into Hive’s UDAF framework • Expect to scan TBs of data at a time • Requirement # 1: Memory usage has to scale sub-linearly with data size • Requirement # 2: Stats task has to complete in a reasonable amount of time • Given these requirements, some statistics such as NDV, histograms are hard to compute!7 ©2012 Cloudera, Inc. All Rights Reserved.
    8. 8. How to Compute NDVs? • Naïve approach • Maintain a count of distinct values in a column • Impractical given memory requirements • Flajolet-Martin approach • Use probabilistic sketches to estimate NDV • Memory required is logarithmic in size of data • Estimates are within 10% of the actual value8 ©2011 Cloudera, Inc. All Rights Reserved.
    9. 9. How to Compute Histograms? • Computing equi-height histograms is a quantile computation/estimation problem • Merging the quantiles computed at the mappers is non-trivial • Deterministic parallel algorithms such as QDigest prohibitive in terms of memory required • Probabilistic algorithms stream counting algorithms such as Count-Min Sketch can be tweaked to estimate quantiles • Memory required is logarithmic in size of data • Computationally expensive!9 ©2012 Cloudera, Inc. All Rights Reserved.
    10. 10. How to Store Column Statistics? • Extend metastore schema to store new statistics • Extend metastore Thrift API to update, query and delete new statistics • Size of the column statistics record in metastore is independent of table/partition size • ~32 bytes/column if histograms are not computed • ~320 bytes/column for 20 bin histogram10 ©2012 Cloudera, Inc. All Rights Reserved.
    11. 11. Summary • Scalar statistics has been implemented for primitive type columns in both tables and partitions • Patch is available on JIRA (HIVE-1362) • Computing Equi-Height Histograms is a WIP11 ©2012 Cloudera, Inc. All Rights Reserved.
    12. 12. Questions?12 ©2012 Cloudera, Inc. All Rights Reserved.
    13. 13. Further Readings • Blog • in-hive/ • Academic • A. Gruenheid, et. al., Query Optimization using Column Statistics in Hive. • S. Chaudhuri, An Overview of Query Optimization in Relational Systems. • P. Flajolet and N.G. Martin, Probabilistic Counting Algorithms for Database Applications. Contact: shreepadma@cloudera.com13 ©2012 Cloudera, Inc. All Rights Reserved.