Integration of Apache Hiveand HBaseEnis Soztutarenis [at] apache [dot] org@enissozArchitecting the Future of Big Data © Ho...
About Me•  User and committer of Hadoop since 2007•  Contributor to Apache Hadoop, HBase, Hive and Gora•  Joined Hortonwor...
Agenda•  Overview of Hive and HBase•  Hive + HBase Features and Improvements•  Future of Hive and HBase•  Q&A         Arch...
Apache Hive Overview• Apache Hive is a data warehouse system for Hadoop• SQL-like query language called HiveQL• Built for ...
Apache Hive Architecture                                  JDBC/ODBC                                     Hive Thrift       ...
Overview of Apache HBase• Apache HBase is the Hadoop database• Modeled after Google’s BigTable• A sparse, distributed, per...
Overview of Apache HBase• Logical view:                                               From: Bigtable: A Distributed Storag...
Apache HBase Architecture            Client                                                HMaster                        ...
Hive + HBase Features andImprovements Architecting the Future of Big Data                                       Page 9 © H...
Hive + HBase Motivation• Hive and HBase has different characteristics:  High latency                                Low la...
Use Case 1: HBase as ETL Data SinkFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www...
Use Case 2: HBase as Data SourceFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.s...
Use Case 3: Low Latency WarehouseFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www....
Example: Hive + Hbase (HBase table)hbase(main):001:0> create short_urls, {NAME =>u}, {NAME=>s}hbase(main):014:0> scan shor...
Example: Hive + HBase (Hive table)CREATE TABLE short_urls(   short_url string,   url string,   hit_count int)STORED BYorg....
Storage Handler• Hive defines HiveStorageHandler class for different storage  backends: HBase/ Cassandra / MongoDB/ etc• S...
Apache Hive + HBase Architecture                                           Hive Thrift         Hive Web                   ...
Hive + HBase Integration• For Input/OutputFormat, getSplits(), etc underlying HBase  classes are used• Column selection an...
Schema / Type MappingArchitecting the Future of Big Data                                      Page 19© Hortonworks Inc. 2011
Schema Mapping•  Hive table + columns + column types <=> HBase table + column   families (+ column qualifiers)•  Every fie...
Type Mapping• Recently added to Hive (0.9.0)• Previously all types were being converted to strings in HBase• Hive has:  – ...
Type Mapping• Table level property  "hbase.table.default.storage.type” = “binary”• Type mapping can be given per column af...
Type Mapping• If the type is not a primitive or Map, it is converted to a JSON  string and serialized• Still a few rough e...
Bulk Load• Steps to bulk load:   – Sample source data for range partitioning   – Save sampling results to a file   – Run C...
Filter PushdownArchitecting the Future of Big Data                                      Page 25© Hortonworks Inc. 2011
Filter Pushdown• Idea is to pass down filter expressions to the storage layer to  minimize scanned data• To access indexes...
Filter Decomposition• Optimizer pushes down the predicates to the query plan• Storage handlers can negotiate with the Hive...
Security AspectsTowards fully secure deploymentsArchitecting the Future of Big Data                                      P...
Security – Big Picture• Security becomes more important to support enterprise level  and multi tenant applications• 5 Diff...
HBase Security – Closer look• Released with HBase 0.92• Fully optional module, disabled by default• Needs an underlying se...
Hive Security – Closer look• Hive has different deployment options, security considerations  should take into account diff...
Hive Deployment Option 1  Client             CLI     Driver                                                               ...
Hive Deployment Option 2 Client            CLI    Driver                                                                 M...
Hive Deployment Option 3  Client                                                 JDBC/ODBC                                ...
Hive + HBase + Hadoop Security• Regardless of Hive’s own security, for Hive to work on  secure Hadoop and HBase, we should...
Future of Hive + HBase• Improve on schema / type mapping• Fully secure Hive deployment options• HBase bulk import improvem...
References• Security  – https://issues.apache.org/jira/browse/HIVE-2764  – https://issues.apache.org/jira/browse/HBASE-537...
Other Resources• Hadoop Summit  – June 13-14  – San Jose, California  – www.Hadoopsummit.org• Hadoop Training and Certific...
ThanksQuestions?    Architecting the Future of Big Data                                          Page 39    © Hortonworks ...
Upcoming SlideShare
Loading in...5
×

Integration of HIve and HBase

13,370

Published on

Published in: Technology

Integration of HIve and HBase

  1. 1. Integration of Apache Hiveand HBaseEnis Soztutarenis [at] apache [dot] org@enissozArchitecting the Future of Big Data © Hortonworks Inc. 2011 Page 1
  2. 2. About Me•  User and committer of Hadoop since 2007•  Contributor to Apache Hadoop, HBase, Hive and Gora•  Joined Hortonworks as Member of Technical Staff•  Twitter: @enissoz Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  3. 3. Agenda•  Overview of Hive and HBase•  Hive + HBase Features and Improvements•  Future of Hive and HBase•  Q&A Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  4. 4. Apache Hive Overview• Apache Hive is a data warehouse system for Hadoop• SQL-like query language called HiveQL• Built for PB scale data• Main purpose is analysis and ad hoc querying• Database / table / partition / bucket – DDL Operations• SQL Types + Complex Types (ARRAY, MAP, etc)• Very extensible• Not for : small data sets, low latency queries, OLTP Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  5. 5. Apache Hive Architecture JDBC/ODBC Hive Thrift Hive Web CLI Server Interface Driver M S C Parser Planner l Metastore i e Execution Optimizer n t MapReduce HDFS RDBMS Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  6. 6. Overview of Apache HBase• Apache HBase is the Hadoop database• Modeled after Google’s BigTable• A sparse, distributed, persistent multi- dimensional sorted map• The map is indexed by a row key, column key, and a timestamp• Each value in the map is an un-interpreted array of bytes• Low latency random data access Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  7. 7. Overview of Apache HBase• Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  8. 8. Apache HBase Architecture Client HMaster Zookeeper Region Region Region server server server Region Region Region Region Region Region HDFS Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  9. 9. Hive + HBase Features andImprovements Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  10. 10. Hive + HBase Motivation• Hive and HBase has different characteristics: High latency Low latency Structured vs. Unstructured Analysts Programmers• Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data• Analyzing HBase data with MapReduce requires custom coding• Hive and SQL are already known by many analysts Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  11. 11. Use Case 1: HBase as ETL Data SinkFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  12. 12. Use Case 2: HBase as Data SourceFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  13. 13. Use Case 3: Low Latency WarehouseFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  14. 14. Example: Hive + Hbase (HBase table)hbase(main):001:0> create short_urls, {NAME =>u}, {NAME=>s}hbase(main):014:0> scan short_urlsROW COLUMN+CELL bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url,value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123 bit.ly/abcd column=u:url,value=example.com/foo Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  15. 15. Example: Hive + HBase (Hive table)CREATE TABLE short_urls( short_url string, url string, hit_count int)STORED BYorg.apache.hadoop.hive.hbase.HBaseStorageHandlerWITH SERDEPROPERTIES("hbase.columns.mapping" = ":key, u:url, s:hits")TBLPROPERTIES("hbase.table.name" = ”short_urls"); Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  16. 16. Storage Handler• Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc• Storage Handler has hooks for –  Getting input / output formats –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc• Storage Handler is a table level concept –  Does not support Hive partitions, and buckets Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  17. 17. Apache Hive + HBase Architecture Hive Thrift Hive Web CLI Server Interface Driver M S Parser Planner C l Metastore i Execution Optimizer e n t StorageHandler MapReduce HBase HDFS RDBMS Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  18. 18. Hive + HBase Integration• For Input/OutputFormat, getSplits(), etc underlying HBase classes are used• Column selection and certain filters can be pushed down• HBase tables can be used with other(Hadoop native) tables and SQL constructs• Hive DDL operations are converted to HBase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  19. 19. Schema / Type MappingArchitecting the Future of Big Data Page 19© Hortonworks Inc. 2011
  20. 20. Schema Mapping•  Hive table + columns + column types <=> HBase table + column families (+ column qualifiers)•  Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) -> MAP fields in Hive – A column (cf:cq)•  Hive table does not need to include all columns in HBase•  CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:") Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  21. 21. Type Mapping• Recently added to Hive (0.9.0)• Previously all types were being converted to strings in HBase• Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING>• HBase does not have types – Bytes.toBytes() Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  22. 22. Type Mapping• Table level property "hbase.table.default.storage.type” = “binary”• Type mapping can be given per column after # – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#-CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string>)WITH SERDEPROPERTIES("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s") Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  23. 23. Type Mapping• If the type is not a primitive or Map, it is converted to a JSON string and serialized• Still a few rough edges for schema and type mapping: – No Hive BINARY support in HBase mapping – No mapping of HBase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into HBase schema Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  24. 24. Bulk Load• Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into HBase table• Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  25. 25. Filter PushdownArchitecting the Future of Big Data Page 25© Hortonworks Inc. 2011
  26. 26. Filter Pushdown• Idea is to pass down filter expressions to the storage layer to minimize scanned data• To access indexes at HDFS or HBase• Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE‘%@gmail.com’;-> scan.setStartRow(Bytes.toBytes(1000000)) Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  27. 27. Filter Decomposition• Optimizer pushes down the predicates to the query plan• Storage handlers can negotiate with the Hive optimizer to decompose the filter x > 3 AND upper(y) = XYZ’• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive• Works with:key = 3, key > 3, etckey > 3 AND key < 100• Only works against constant expressions Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  28. 28. Security AspectsTowards fully secure deploymentsArchitecting the Future of Big Data Page 28© Hortonworks Inc. 2011
  29. 29. Security – Big Picture• Security becomes more important to support enterprise level and multi tenant applications• 5 Different Components to ensure / impose security – HDFS – MapReduce – HBase – Zookeeper – Hive• Each component has: – Authentication – Authorization Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  30. 30. HBase Security – Closer look• Released with HBase 0.92• Fully optional module, disabled by default• Needs an underlying secure Hadoop release• SecureRPCEngine: optional engine enforcing SASL authentication – Kerberos – DIGEST-MD5 based tokens – TokenProvider coprocessor• Access control is implemented as a Coprocessor: AccessController• Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by HBase daemons – Client does not need to authenticate to zk Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  31. 31. Hive Security – Closer look• Hive has different deployment options, security considerations should take into account different deployments• Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC• Authorization is enforced at the query layer (Driver)• Pluggable authorization providers. Default one stores global/ table/partition/column permissions in MetastoreGRANT ALTER ON TABLE web_table TO USER bob;CREATE ROLE db_readerGRANT SELECT, SHOW_DATABASE ON DATABASE mydb TOROLE db_reader Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  32. 32. Hive Deployment Option 1 Client CLI Driver M Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  33. 33. Hive Deployment Option 2 Client CLI Driver M Authorization S C Parser Planner l Authentication i e Metastore n Execution Optimizer t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  34. 34. Hive Deployment Option 3 Client JDBC/ODBC Hive Thrift Hive Web CLI Server Interface M Driver Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N HDFS A12n/A11N RDBMS Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  35. 35. Hive + HBase + Hadoop Security• Regardless of Hive’s own security, for Hive to work on secure Hadoop and HBase, we should: – Obtain delegation tokens for Hadoop and HBase jobs – Ensure to obey the storage level (HDFS, HBase) permission checks – In HiveServer deployments, authenticate and impersonate the user• Delegation tokens for Hadoop are already working• Obtaining HBase delegation tokens are released in Hive 0.9.0 Architecting the Future of Big Data Page 35 © Hortonworks Inc. 2011
  36. 36. Future of Hive + HBase• Improve on schema / type mapping• Fully secure Hive deployment options• HBase bulk import improvements• Sortable signed numeric types in HBase• Filter pushdown: non key column filters• Hive random access support for HBase – https://cwiki.apache.org/HCATALOG/random-access- framework.html Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  37. 37. References• Security – https://issues.apache.org/jira/browse/HIVE-2764 – https://issues.apache.org/jira/browse/HBASE-5371 – https://issues.apache.org/jira/browse/HCATALOG-245 – https://issues.apache.org/jira/browse/HCATALOG-260 – https://issues.apache.org/jira/browse/HCATALOG-244 – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security +Design• Type mapping / Filter Pushdown – https://issues.apache.org/jira/browse/HIVE-1634 – https://issues.apache.org/jira/browse/HIVE-1226 – https://issues.apache.org/jira/browse/HIVE-1643 – https://issues.apache.org/jira/browse/HIVE-2815 – https://issues.apache.org/jira/browse/HIVE-1643 Architecting the Future of Big Data Page 37 © Hortonworks Inc. 2011
  38. 38. Other Resources• Hadoop Summit – June 13-14 – San Jose, California – www.Hadoopsummit.org• Hadoop Training and Certification – Developing Solutions Using Apache Hadoop – Administering Apache Hadoop – Online classes available US, India, EMEA – http://hortonworks.com/training/ © Hortonworks Inc. 2012 Page 38
  39. 39. ThanksQuestions? Architecting the Future of Big Data Page 39 © Hortonworks Inc. 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×