Hive User Meeting March 2010 - Hive Team


Published on

New features and APIs in Hive

Published in: Technology
  • Apache Hive Tutorial (Videos and Books) Just $14
    Are you sure you want to  Yes  No
    Your message goes here
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi,
    How is SELECT A.* FROM A LEFT SEMI JOIN B ON (A.KEY=B.KEY AND B.VALUE>100); possible? I thought join conditions had to be only equality conditions (which the B.VALUE>100 is not)
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hive User Meeting March 2010 - Hive Team

  1. 1. Hive New Features and API Facebook Hive Team March 2010
  2. 2. JDBC/ODBC and CTAS
  3. 3. Hive ODBC Driver <ul><li>Architecture: </li></ul><ul><ul><li>Client/DriverManager  local call dynamic libraies </li></ul></ul><ul><ul><li>unixODBC ( + hiveclient( + thriftclient (  network socket </li></ul></ul><ul><ul><li>HiveServer (in Java)  local call </li></ul></ul><ul><ul><li>Hive + Hadoop </li></ul></ul><ul><li>unixODBC is not part of Hive open source, so you need to build it by yourself. </li></ul><ul><ul><li>32-bit/64-bit architecture </li></ul></ul><ul><ul><li>Thrift has to be r790732 </li></ul></ul><ul><ul><li>Boost libraries </li></ul></ul><ul><ul><li>Linking with 3 rd party Driver Manager. </li></ul></ul>Facebook
  4. 4. Facebook Use Case <ul><li>Hive integration with MicroStrategy 8.1.2 (HIVE-187) and 9.0.1. (HIVE-1101) </li></ul><ul><ul><ul><li>FreeForm SQL (reported generated from user input queries) </li></ul></ul></ul><ul><ul><ul><li>Reports generated daily. </li></ul></ul></ul><ul><li>All servers (MSTR IS server, HiveServer) are running on Linux. </li></ul><ul><ul><ul><li>ODBC driver needs to be 32 bits. </li></ul></ul></ul>Facebook
  5. 5. Hive JDBC <ul><li>Embedded mode: </li></ul><ul><ul><li>jdbc:hive:// </li></ul></ul><ul><li>Client/server mode: </li></ul><ul><ul><li>jdbc:hive://host:port/dbname </li></ul></ul><ul><ul><li>host:port is where the hive server is listening. </li></ul></ul><ul><ul><li>Architecture is similar to ODBC. </li></ul></ul>Facebook
  6. 6. Create table as select (CTAS) <ul><ul><li>New feature in branch 0.5. </li></ul></ul><ul><ul><li>E.g., </li></ul></ul><ul><ul><li>CREATE TABLE T STORED AS TEXTFILE AS </li></ul></ul><ul><ul><li>SELECT a+1 a1, concat(b,c,d) b2 </li></ul></ul><ul><ul><li>FROM S </li></ul></ul><ul><ul><li>WHERE … </li></ul></ul><ul><ul><li>Resulting schema: </li></ul></ul><ul><ul><li>T (a1 double, b2 string) </li></ul></ul><ul><ul><li>The create-clause can take all table properties except external table or partitioned table (on roadmap). </li></ul></ul><ul><ul><li>Atomicity: T will not be created if the select statement has an error. </li></ul></ul>Facebook
  7. 7. Join Strategies
  8. 8. Left semi join <ul><li>Implementing IN/EXISTS subquery semantics: </li></ul><ul><ul><li> SELECT A.* </li></ul></ul><ul><ul><li>FROM A WHERE A.KEY IN </li></ul></ul><ul><ul><li>(SELECT B.KEY FROM B WHERE B.VALUE > 100); </li></ul></ul><ul><ul><li>Is equivalent to: </li></ul></ul><ul><ul><li>SELECT A.* </li></ul></ul><ul><ul><li>FROM A LEFT SEMI JOIN B </li></ul></ul><ul><ul><li>ON (A.KEY = B.KEY and B.VALUE > 100); </li></ul></ul><ul><li>Optimizations: </li></ul><ul><ul><li>map-side groupby to reduce data flowing to reducers </li></ul></ul><ul><ul><li>early exit if match in join. </li></ul></ul>Facebook
  9. 9. Map Join Implementation SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key; Table b Table a Table c Mapper 1 File a1 File a2 File c1 Mapper 2 Mapper 3 a1 a2 c1 a1 a2 c1 a1 a2 c1 <ul><li>Spawn mapper based on the big table </li></ul><ul><li>All files of all small tables are </li></ul><ul><li>replicated onto each mapper </li></ul>
  10. 10. Bucket Map Join <ul><ul><li>set hive.optimize.bucketmapjoin = true; </li></ul></ul><ul><ul><li>Work together with map join </li></ul></ul><ul><ul><li>All join tables are bucketized, and each small table’s bucket number can be divided by big table’s bucket number. </li></ul></ul><ul><ul><li>Bucket columns == Join columns </li></ul></ul>
  11. 11. Bucket Map Join Implementation SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key; Table b Table a Table c Mapper 1 Bucket b1 Bucket a1 Bucket a2 Bucket c1 Mapper 2 Bucket b1 Mapper 3 Bucket b2 a1 c1 a1 c1 a2 c1 Normally in production, there will be thousands of buckets! Table a,b,c all bucketized by ‘key’ a has 2 buckets, b has 2, and c has 1 <ul><li>Spawn mapper based on the big table </li></ul><ul><li>Only matching buckets of all small tables are replicated onto each mapper </li></ul>
  12. 12. Sort Merge Bucket Map Join <ul><ul><li>set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set; </li></ul></ul><ul><ul><li>Work together with bucket map join </li></ul></ul><ul><ul><li>Bucket columns == Join columns == sort columns </li></ul></ul><ul><ul><li>If partitioned, only big table can allow multiple partitions, small tables must be restricted to a single partition by query. </li></ul></ul>
  13. 13. Sort Merge Bucket Map Join Facebook Table A Table B Table C 1, val_1 3, val_3 5, val_5 4, val_4 4, val_4 20, val_20 23, val_23 20, val_20 25, val_25 <ul><ul><li>Small tables are read on demand NOT held entire small tables in memory Can perform outer join </li></ul></ul>
  14. 14. Skew Join <ul><li>Join bottlenecked on the reducer who gets the skewed key </li></ul><ul><li>set hive.optimize.skewjoin = true; set hive.skewjoin.key = skew_key_threshold </li></ul>
  15. 15. Skew Join Reducer 1 Reducer 2 a-K 3 b-K 3 a-K 3 b-K 3 a-K 2 b-K 2 a-K 2 b-K 2 a-K 1 b-K 1 Table A Table B A join B Write to HDFS HDFS File a-K1 HDFS File b-K1 Map join a-k1 map join b-k1 Job 1 Job 2 Final results
  16. 16. Future Work <ul><li>Skew Join with a Replication Algorithm </li></ul><ul><li>Memory Footprint Optimization </li></ul>
  17. 17. Views, HBase Integration
  18. 18. CREATE VIEW Syntax <ul><ul><li>CREATE VIEW [IF NOT EXISTS] view_name </li></ul></ul><ul><ul><li>[ (column_name [COMMENT column_comment], … ) ] </li></ul></ul><ul><ul><li>[COMMENT view_comment] </li></ul></ul><ul><ul><li>AS SELECT … </li></ul></ul><ul><ul><li>[ ORDER BY … LIMIT … ] </li></ul></ul><ul><ul><li>-- example </li></ul></ul><ul><ul><li>CREATE VIEW pokebaz(baz COMMENT ‘this column used to be bar’) </li></ul></ul><ul><ul><li>COMMENT ‘views are good for layering on renaming’ </li></ul></ul><ul><ul><li>AS SELECT bar FROM pokes; </li></ul></ul>Facebook
  19. 19. View Features <ul><li>Other commands </li></ul><ul><ul><li>SHOW TABLES: views show up too </li></ul></ul><ul><ul><li>DESCRIBE: see view column descriptions </li></ul></ul><ul><ul><li>DESCRIBE EXTENDED: retrieve view definition </li></ul></ul><ul><li>Enhancements on the way soon </li></ul><ul><ul><li>Dependency management (e.g. CASCADE/RESTRICT) </li></ul></ul><ul><ul><li>Partition awareness </li></ul></ul><ul><li>Enhancements (long term) </li></ul><ul><ul><li>Updatable views </li></ul></ul><ul><ul><li>Materialized views </li></ul></ul>Facebook
  20. 20. HBase Storage Handler <ul><li>CREATE TABLE users( </li></ul><ul><li>userid int, name string, email string, notes string) </li></ul><ul><li>STORED BY </li></ul><ul><li>'org.apache.hadoop.hive.hbase.HBaseStorageHandler' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>&quot;hbase.columns.mapping&quot; = </li></ul><ul><li>” small:name,small:email,large:notes”); </li></ul>Facebook
  21. 21. HBase Storage Handler Features <ul><li>Commands supported </li></ul><ul><ul><li>CREATE EXTERNAL TABLE: register existing HTable </li></ul></ul><ul><ul><li>SELECT: join, group by, union, etc; over multiple Hbase tables, or mixing with native Hive tables </li></ul></ul><ul><ul><li>INSERT: from any Hive query </li></ul></ul><ul><li>Enhancements Needed (feedback on priority welcome) </li></ul><ul><ul><li>More flexible column mapping, ALTER TABLE </li></ul></ul><ul><ul><li>Timestamp read/write/restrict </li></ul></ul><ul><ul><li>Filter pushdown </li></ul></ul><ul><ul><li>Partition support </li></ul></ul><ul><ul><li>Write atomicity </li></ul></ul>Facebook
  22. 22. UDF, UDAF and UDTF
  23. 23. User-Defined Functions (UDF) <ul><li>1 input to 1 output </li></ul><ul><li>Typically used in select </li></ul><ul><ul><li>SELECT concat(first, ‘ ‘, last) AS full_name… </li></ul></ul><ul><li>See Hive language wiki for full list of built-in UDF’s </li></ul><ul><ul><li> </li></ul></ul><ul><li>Noteworthy features </li></ul><ul><ul><li>Sometimes you want to cast </li></ul></ul><ul><ul><ul><li>SELECT CAST(5.0/2.0 AS INT)… </li></ul></ul></ul><ul><ul><li>Conditional functions </li></ul></ul><ul><ul><ul><li>SELECT IF(boolean, if_true, if_not_true)… </li></ul></ul></ul>Facebook
  24. 24. User Defined Aggregate Functions (UDAF) <ul><li>N inputs to 1 output </li></ul><ul><li>Typically used with GROUP BY </li></ul><ul><ul><li>SELECT count(1) FROM … GROUP BY age </li></ul></ul><ul><ul><li>SELECT count(DISTINCT first_name) GROUP BY last_name… </li></ul></ul><ul><ul><li>sum(), avg(), min(), max() </li></ul></ul><ul><li>For skew </li></ul><ul><ul><li>set hive.groupby.skewindata = true; </li></ul></ul><ul><ul><li>set = <some lower value> </li></ul></ul>Facebook
  25. 25. User Defined Table-Generating Functions (UDTF) <ul><li>1 input to N outputs </li></ul><ul><li>explode(Array<?> arg) </li></ul><ul><ul><li>Converts an array into multiple rows, with one element per row </li></ul></ul><ul><li>Transform-like syntax </li></ul><ul><ul><li>SELECT udtf(col0, col1, …) AS colAlias FROM srcTable </li></ul></ul><ul><li>Lateral view syntax </li></ul><ul><ul><li>… FROM baseTable LATERAL VIEW udtf(col0, col1…) tableAlias AS colAlias </li></ul></ul><ul><li>Also see: </li></ul>Facebook
  26. 26. UDTF using Transform Syntax <ul><li>SELECT explode(group_ids) AS group_id FROM src </li></ul>Facebook Table src Output
  27. 27. UDTF using Lateral View Syntax <ul><li>SELECT src.*, myTable.* FROM src LATERAL VIEW explode(group_ids) myTable AS group_id </li></ul>Facebook Table src
  28. 28. UDTF using Lateral View Syntax Facebook explode(group_ids) myTable AS group_id Join input rows to output rows src group_id 1 2 3 Result
  29. 29. SerDe – Serialization/Deserialization
  30. 30. SerDe Examples <ul><li>CREATE TABLE mylog ( </li></ul><ul><li>user_id BIGINT, </li></ul><ul><li>page_url STRING, </li></ul><ul><li>unix_time INT) </li></ul><ul><li>ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ; </li></ul><ul><li>CREATE table mylog_rc ( </li></ul><ul><li>user_id BIGINT, </li></ul><ul><li>page_url STRING, </li></ul><ul><li>unix_time INT) </li></ul><ul><li>ROW FORMAT SERDE </li></ul><ul><li>'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' </li></ul><ul><li>STORED AS RCFILE; </li></ul>Facebook
  31. 31. SerDe <ul><li>SerDe is short for serialization/deserialization. It controls the format of a row. </li></ul><ul><li>Serialized format: </li></ul><ul><ul><li>Delimited format (tab, comma, ctrl-a …) </li></ul></ul><ul><ul><li>Thrift Protocols </li></ul></ul><ul><ul><li>ProtocolBuffer* </li></ul></ul><ul><li>Deserialized (in-memory) format: </li></ul><ul><ul><li>Java Integer/String/ArrayList/HashMap </li></ul></ul><ul><ul><li>Hadoop Writable classes </li></ul></ul><ul><ul><li>User-defined Java Classes (Thrift, ProtocolBuffer*) </li></ul></ul><ul><li>* ProtocolBuffer support not available yet. </li></ul>Facebook
  32. 32. Where is SerDe? Facebook File on HDFS Hierarchical Object Writable Stream Stream Hierarchical Object Map Output File Writable Writable Writable Hierarchical Object File on HDFS User Script Hierarchical Object Hierarchical Object Hive Operator Hive Operator SerDe FileFormat / Hadoop Serialization Mapper Reducer ObjectInspector imp 1.0 3 54 Imp 0.2 1 33 clk 2.2 8 212 Imp 0.7 2 22 thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> BytesWritable(x3Fx64x72x00) Java Object Object of a Java Class Standard Object Use ArrayList for struct and array Use HashMap for map LazyObject Lazily-deserialized Writable Writable Writable Writable Text(‘imp 1.0 3 54’) // UTF8 encoded
  33. 33. Object Inspector Facebook Hierarchical Object Writable Writable Struct int string list struct map string string Hierarchical Object String Object TypeInfo BytesWritable(x3Fx64x72x00) Text(‘ a=av:b=bv 23 1:2=4:5 abcd ’) class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } Class ClassC { Integer a, Integer b; } List ( HashMap(“a”  “av”, “b”  “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “ abcd” ) int int HashMap(“a”  “av”, “b”  “bv”), HashMap<String, String> a, “ av” getType ObjectInspector1 getFieldOI getStructField getType ObjectInspector2 getMapValueOI getMapValue deserialize SerDe serialize getOI getType ObjectInspector3
  34. 34. When to add a new SerDe <ul><li>User has data with special serialized format not supported by Hive yet, and user does not want to convert the data before loading into Hive. </li></ul><ul><li>User has a more efficient way of serializing the data on disk. </li></ul>Facebook
  35. 35. How to add a new SerDe for text data <ul><li>Follow the example in contrib/src/java/org/apache/hadoop/hive/contrib/serde2/ </li></ul><ul><li>RegexSerDe uses a user-provided regular expression to deserialize data. </li></ul><ul><li>CREATE TABLE apache_log(host STRING, </li></ul><ul><li>identity STRING, user STRING, time STRING, request STRING, </li></ul><ul><li>status STRING, size STRING, referer STRING, agent STRING) </li></ul><ul><li>ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>&quot;input.regex&quot; = &quot;([^ ]*) ([^ ]*) ([^ ]*) (-|[^]*) ([^ &quot;]*|&quot;[^&quot;]*&quot;) (-|[0-9]*) (-|[0-9]*)(?: ([^ &quot;]*|&quot;[^&quot;]*&quot;) ([^ &quot;]*|&quot;[^&quot;]*&quot;))?&quot;, </li></ul><ul><li>&quot;output.format.string&quot; = &quot;%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”) </li></ul><ul><li>STORED AS TEXTFILE; </li></ul>Facebook
  36. 36. How to add a new SerDe for binary data <ul><li>Follow the example in contrib/src/java/org/apache/hadoop/hive/contrib/serde2/thrift (HIVE-706) serde/src/java/org/apache/hadoop/hive/serde2/binarysortable </li></ul><ul><li>CREATE TABLE mythrift_table </li></ul><ul><li>ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe' </li></ul><ul><li>WITH SERDEPROPERTIES ( </li></ul><ul><li>&quot;serialization.class&quot; = &quot;com.facebook.serde.tprofiles.full&quot;, </li></ul><ul><li>&quot;serialization.format&quot; = &quot;com.facebook.thrift.protocol.TBinaryProtocol“) ; </li></ul><ul><li>NOTE: Column information is provided by the SerDe class. </li></ul>Facebook
  37. 37. Q & A Facebook