Your SlideShare is downloading. ×
0
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HBaseCon 2013: Integration of Apache Hive and HBase

4,796

Published on

Presented by: Enis Söztutar (Hortonworks) and Ashtutosh Chahaun (Hortonworks)

Presented by: Enis Söztutar (Hortonworks) and Ashtutosh Chahaun (Hortonworks)

Published in: Technology, Design
0 Comments
35 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,796
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
35
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. © Hortonworks Inc. 2011 Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org Page 1 Ashutosh Chauhan hashutosh [at] apache [dot] org
  • 2. © Hortonworks Inc. 2011 About Us Page 2 Architecting the Future of Big Data Enis Soztutar •  In the Hadoop space since 2007 •  Committer and PMC Member in Apache HBase and Hadoop •  Twitter: @enissoz Ashutosh Chauhan •  In the Hadoop space since 2009 •  Committer and PMC Member in Apache Hive and Pig
  • 3. © Hortonworks Inc. 2011 Agenda Page 3 Architecting the Future of Big Data •  Overview of Hive •  Hive + HBase Features and Improvements •  Future of Hive and HBase •  Q&A
  • 4. © Hortonworks Inc. 2011 Apache Hive Overview • Apache Hive is a data warehouse system for Hadoop • SQL-like query language called HiveQL • Built for PB scale data • Main purpose is analysis and ad hoc querying • Database / table / partition / bucket – DDL Operations • SQL Types + Complex Types (ARRAY, MAP, etc) • Very extensible • Not for : small data sets, OLTP Page 4 Architecting the Future of Big Data
  • 5. © Hortonworks Inc. 2011 Apache Hive Architecture Page 5 Architecting the Future of Big Data Metastore RDBMS Hive Thrift Server Driver CLI JDBC/ODBC Hive Web Interface HDFS MapReduce Execution Parser Planner Optimizer M S C l i e n t
  • 6. © Hortonworks Inc. 2011 Hive + HBase Features and Improvements Architecting the Future of Big Data Page 6
  • 7. © Hortonworks Inc. 2011 Hive + HBase Motivation • Hive over HDFS and HBase has different characteristics –  Batch Online –  Structured vs Unstructured – Analysts Programmers • Hive datawarehouses on HDFS are – Long ETL times – Access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts Page 7 Architecting the Future of Big Data
  • 8. © Hortonworks Inc. 2011 Use Case 1: HBase as ETL Data Sink Page 8 Architecting the Future of Big Data From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 HDFS Tables INSERT … SELECT …! FROM … ! HBase Online Queries
  • 9. © Hortonworks Inc. 2011 Use Case 2: HBase as Data Source Page 9 Architecting the Future of Big Data From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 HDFS Tables SELECT … JOIN …! GROUP BY … ! HBase Query Result
  • 10. © Hortonworks Inc. 2011 Use Case 3: Low Latency Warehouse Page 10 Architecting the Future of Big Data From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 HDFS Tables HBase Continuous Updates HIVE QUERIES! Periodic Dump
  • 11. © Hortonworks Inc. 2011 Hive + HBase Example (HBase table) hbase(main):001:0> create 'short_urls', {NAME => 'u'}, {NAME=>'s'} hbase(main):014:0> scan 'short_urls' ROW COLUMN+CELL bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url, value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123 bit.ly/abcd column=u:url, value=example.com/foo Page 11 Architecting the Future of Big Data
  • 12. © Hortonworks Inc. 2011 Hive + HBase Example (Hive table) CREATE TABLE short_urls( short_url string, url string, hit_count int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits") TBLPROPERTIES ("hbase.table.name" = ”short_urls"); Page 12 Architecting the Future of Big Data
  • 13. © Hortonworks Inc. 2011 Storage Handler • Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc • Storage Handler has hooks for –  Getting input / output formats –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc • Storage Handler is a table level concept –  Does not support Hive partitions, and buckets Page 13 Architecting the Future of Big Data
  • 14. © Hortonworks Inc. 2011 Apache Hive + HBase Architecture Page 14 Architecting the Future of Big Data Metastore RDBMS Hive Thrift Server Driver CLI Hive Web Interface HDFS MapReduce Execution Parser Planner Optimizer M S C l i e n t HBase StorageHandler
  • 15. © Hortonworks Inc. 2011 Hive + HBase Integration • For Input/OutputFormat, getSplits(), etc underlying HBase classes are used • Column selection and certain filters can be pushed down • HBase tables can be used with other(Hadoop native) tables and SQL constructs • Hive DDL operations are converted to HBase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Page 15 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 Schema / Type Mapping Architecting the Future of Big Data Page 16
  • 17. © Hortonworks Inc. 2011 Schema Mapping • Hive table + columns + column types <=> HBase table + column families (+ column qualifiers) • Every field in Hive table is mapped to either – The table key (using :key as selector) – A column family (cf:) -> MAP fields in Hive – A column (cf:cq) •  Hive table does not need to include all columns in Hbase Page 17 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 Schema Mapping - Example Page 18 Architecting the Future of Big Data CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:")
  • 19. © Hortonworks Inc. 2011 Schema Mapping - Example Page 19 Architecting the Future of Big Data CREATE TABLE short_urls( short_url string, url string, hit_count int, props map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:")
  • 20. © Hortonworks Inc. 2011 Type Mapping • Added in Hive (0.9.0) • Previously all types were being converted to strings in HBase • Hive has: – Primitive types: INT, STRING, BINARY, DOUBLE etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING> • HBase does not have types – Bytes.toBytes() Page 20 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Type Mapping • Table level property "hbase.table.default.storage.type” = “binary” • Type mapping can be given per column after # – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#- Page 21
  • 22. © Hortonworks Inc. 2011 Type Mapping - Example Page 22 Architecting the Future of Big Data CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")
  • 23. © Hortonworks Inc. 2011 Type Mapping • If the type is not a primitive or Map, it is converted to a JSON string and serialized • Still a few rough edges for schema and type mapping: – No support for DECIMAL, BINARY Hive types – No mapping of HBase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into HBase schema Page 23 Architecting the Future of Big Data
  • 24. © Hortonworks Inc. 2011 Bulk Load • Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into HBase table • Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Page 24 Architecting the Future of Big Data
  • 25. © Hortonworks Inc. 2011 Filter Pushdown Architecting the Future of Big Data Page 25
  • 26. © Hortonworks Inc. 2011 Filter Pushdown • Idea is to pass down filter expressions to the storage layer to minimize scanned data • To access indexes at hdfs or hbase • Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE ‘%@gmail.com’; -> scan.setStartRow(Bytes.toBytes(1000000)) Page 26 Architecting the Future of Big Data
  • 27. © Hortonworks Inc. 2011 Filter Decomposition • Optimizer pushes down the predicates to the query plan • Storage handlers can negotiate with the Hive optimizer to decompose the filter x > 3 AND upper(y) = 'XYZ’ • Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive • Works with: key = 3, key > 3, etc key > 3 AND key < 100 • Only works against constant expressions Page 27 Architecting the Future of Big Data
  • 28. © Hortonworks Inc. 2011 Future of Hive + HBase • Improve on schema / type mapping • Fully secure Hive deployment options • HBase bulk import improvements • Filter pushdown: non key column filters • Sortable signed numeric types in HBase • Use HBase’s new typing API’s (upcoming in HBase) • Integration with Phoenix / extract common modules, hbase- sql ? Page 28 Architecting the Future of Big Data
  • 29. © Hortonworks Inc. 2011 References • Type mapping / Filter Pushdown – https://issues.apache.org/jira/browse/HIVE-1634 – https://issues.apache.org/jira/browse/HIVE-1226 – https://issues.apache.org/jira/browse/HIVE-1643 – https://issues.apache.org/jira/browse/HIVE-2815 Page 29 Architecting the Future of Big Data
  • 30. © Hortonworks Inc. 2011 Thanks Questions? Architecting the Future of Big Data Page 30

×