• Save
Mar 2012 HUG: Hive with HBase
Upcoming SlideShare
Loading in...5
×
 

Mar 2012 HUG: Hive with HBase

on

  • 11,048 views

Apache Hive and HBase are very popular projects in the Hadoop ecosystem. Using Hive with HBase was made possible by contributions from Facebook around 2010. In this talk, we will go over the details ...

Apache Hive and HBase are very popular projects in the Hadoop ecosystem. Using Hive with HBase was made possible by contributions from Facebook around 2010. In this talk, we will go over the details of how the integration works, and talk about recent improvements. Specifically, we will cover the basic architecture, schema and data type mappings, and recent filter pushdown optimizations. We will also go into detail about the security aspects of Hadoop/HBase related to Hive setups.

Statistics

Views

Total Views
11,048
Views on SlideShare
10,641
Embed Views
407

Actions

Likes
47
Downloads
0
Comments
0

4 Embeds 407

http://www.scoop.it 398
http://www.pinterest.com 5
http://webcache.googleusercontent.com 2
http://pinterest.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mar 2012 HUG: Hive with HBase Mar 2012 HUG: Hive with HBase Presentation Transcript

  • Using Apache Hivewith HBaseAnd Recent ImprovementsEnis Soztutarenis [at] apache [dot] org@enissoz© Hortonworks Inc. 2011 Page 1
  • Outline of the talk• Apache Hive 101• Apache Hbase 101• Hive + Hbase Motivation• Schema Mapping• Type Mapping• Bulk Load• Filter Pushdown• Security Aspects• Future Work Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • Apache Hive 101• Hive is a data warehouse system for Hadoop• SQL-like query language called HiveQL• Built for PB scale data• Main purpose is analysis and ad hoc querying• Database / table / partition / bucket – DDL Operations• SQL Types + Complex Types (ARRAY, MAP, etc)• Very extensible• Not for : small data sets, low latency queries, OLTP Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • Apache Hive Architecture JDBC/ ODBC Hive Thrift Hive Web CLI Server Interface M Driver S Parser Planner C li Metastore e Execution Optimizer n t MapReduce HDFS RDBMS Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • Apache HBase101•  Apache Hbase is the Hadoop database•  Modeled after Google’s BigTable•  A sparse, distributed, persistent multi- dimensional sorted map.•  The map is indexed by a row key, column key, and a timestamp.•  Each value in the map is an uninterpreted array of bytes.•  Low latency random data access•  Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • Apache HBase Architecture Client HMaster ZookeRegion Region Region eperserver server server Region Region Region Region Region Region DFS Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • Hive + HBaseHBase goes SQLArchitecting the Future of Big Data Page 7© Hortonworks Inc. 2011
  • Hive + HBase motivation• Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data• Analyzing HBase data with MapReduce requires custom coding• Hive and SQL are already known by many analysts Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • Use Case 1: Hbase as ETL Data SinkFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • Use Case 2: Hbase as data sourceFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • Use Case 3: Low latency warehouseFrom HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebookhttp://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • Hive + HBase ExampleCREATE TABLE short_urls( short_url string, url string, hit_count int)STORED BYorg.apache.hadoop.hive.hbase.HBaseStorageHandlerWITH SERDEPROPERTIES("hbase.columns.mapping" = ":key, u:url, s:hits")TBLPROPERTIES("hbase.table.name" = ”short_urls"); Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • Storage Handler• Hive defines HiveStorageHandler class for different storage backends: Hbase/ Cassandra / MongoDB/ etc• Storage Handler has hooks for – getInput/OutputFormat() – getSerde() – configureTableJobProperties() – Meta data operations hook: CREATE TABLE, DROP TABLE, etc – getAuthorizationProvider()• Storage Handler is a table level concept. Does not support Hive partitions, and buckets. Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • Apache Hive+HBase Architecture Hive Thrift Hive Web CLI Server Interface M Driver S Parser Planner C li Metastore e Execution Optimizer n t StorageHandler MapReduce HBase HDFS RDBMS Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • Hive + HBase• For Input/OutputFormat, getSplits(), etc underlying Hbase classes are used• Column selection and certain filters can be pushed down• Hbase tables can be used with other(Hadoop native) tables and SQL constructs• Hive DDL operations are converted to Hbase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • Schema / Type MappingArchitecting the Future of Big Data Page 16© Hortonworks Inc. 2011
  • Schema Mapping•  Hive schema has table + columns + column types, Hbase has table + column families (+ column qualifiers)•  Hive’s table schema is mapped by:("hbase.columns.mapping" = ":key, u:url, s:hits")•  Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) – A column (cf:cq)•  Hbase column family can only be mapped to Hive MAP type•  Hive table does not need to include all columns in Hbase•  Map<Key,Value> is converted to CF:Key->Value Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • Type Mapping•  Recently added to Hive•  Previously all types were being converted to strings in Hbase•  Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING> – UNIONTYPE<INT,STRING>•  Hbase does not have types – Bytes.toBytes() Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • Type Mapping• Table level property "hbase.table.default.storage.type” = “binary”• ("hbase.columns.mapping" = ":key#binary, u:url#binary, s:hits#binary")•  Type mapping is given after #, and can be – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#-•  String means use UTF8 serialization•  Binary means use binary serialization for primitive types and maps• Dash means use table level Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • Type Mapping•  If the type is not a primitive or Map, it is converted to a JSON string and serialized•  Still a few rough edges for schema and type mapping: – No Hive BINARY support in Hbase mapping – No mapping of Hbase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into hbase schema Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • Bulk Load•  Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into Hbase table• Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • Filter PushdownArchitecting the Future of Big Data Page 22© Hortonworks Inc. 2011
  • Filter Pushdown•  Idea is to pass down filter expressions to the storage layer to minimize scanned data•  Use cases are: – Pushing filters to access plans so that indexes can be used – Pushing filters to RCFiles – Pushing filters to StorageHandlers (HBase)•  Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE‘%@gmail.com’;-> scan.setStartRow(Bytes.toBytes(1000000)) Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • Filter Decomposition•  Optimizer pushes down the predicates to the query plan•  Storage handlers can negotiate with the Hive optimizer to decompose the filter.• x > 3 AND upper(y) = XYZ’• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive to deal with.•  Optional HiveStoragePredicateHandler interface, defines decomposePredicate(…, ExprNodeDesc predicate)•  Works with key = 3, key > 3, etc•  key > 3 AND key < 100 in Patch Available•  Only works against constant expressions Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • Security AspectsTowards fully secure deploymentsArchitecting the Future of Big Data Page 25© Hortonworks Inc. 2011
  • Security – Big Picture• Security becomes more important to support enterprise level and multi tenant applications• 5 Different Components to ensure / impose security – HDFS – MapReduce – Hbase – Zookeeper – Hive• Each component has – Authentication – Authorization• What about Hcatalog ? Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • Security Components•  HDFS •  HBase –  Authentication –  Authentication –  Kerberos –  Kerberos –  NN + Block Access tokens –  Hbase delegation tokens –  Posix-like File permissions –  Global / Table / CF / Column level ACLs•  MapReduce •  Zookeeper –  Authentication –  Pluggable authentication –  Kerberos –  Kerberos authentication (SASL) –  Tokens stored in Job –  Znode ACLs •  Hive –  Mapreduce delegation tokens (Oozie) –  MS authentication –  Kerberos –  Authorization –  MS Delegation tokens –  Job / Queue ACLs –  Service-Level ACLs –  Pluggable authorization –  Mysql-like Role based access control Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • Hbase Security – Closer look•  Released with Hbase 0.92•  Fully optional module, disabled by default•  Needs an underlying secure Hadoop release•  SecureRPCEngine: optional engine enforcing SASL authentication –  Kerberos –  DIGEST-MD5 based tokens –  TokenProvider coprocessor•  Access control is implemented as a Coprocessor: AccessController•  Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by hbase daemons – Client does not need to authenticate to zk Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • Hbase Security – Closer look•  The ACL lists can be defined at the global/table/column family or column qualifier level for users and groups•  No roles yet•  There are 5 actions (privileges): READ, WRITE, EXEC, CREATE, and ADMIN (RWXCA)grant bobsmith, RW, t1 [,f1][,col1]revoke bobsmith, t1, f1 [,f1][,col1]user_permission table1 Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • Hbase Security – Closer look• create/drop/alter table operations are associated with global level CREATE/DROP or ADMIN permissions.• put/get/scan operations are defined as per table/cf/cq 1. All users need read access to .META. and ROOT tables. 2. The table owner has full privileges 3. check for the table-level, if successful we can short-circuit 4. check permissions against the requested families for all families a) check for family level access b) if no family level grant is found, check for qualifier level access 5. no families to check and table level access failed, deny the request. Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • Hive Security – Closer look•  Hive has different deployment options, security considerations should take into account different deployments•  Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC•  Authorization is enforced at the query layer (Driver)•  Pluggable authorization providers. Default one stores global/table/partiton/column permissions in metastore.GRANT ALTER ON TABLE web_table TO USER bob;CREATE ROLE db_readerGRANT SELECT, SHOW_DATABASE ON DATABASE mydb TOROLE db_reader Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • Hive Deployment Option 1client CLI M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N RDBMS HDFS Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • Hive Deployment Option 2client CLI M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N RDBMS HDFS Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • Hive Deployment Option 3client JDBC/ ODBC Hive Thrift Hive Web CLI Server Interface M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • Hive + Hbase + Hadoop Security•  So, regardless of Hive’s own security, for Hive to work on secure Hadoop and Hbase, we should: – Obtain delegation tokens for Hadoop and HBase jobs – Ensure to obey the storage level (hdfs,hbase) permission checks – In HiveServer deployments, authenticate and impersonate the user•  Delegation tokens for Hadoop are already working. Obtaining Hbase delegation tokens are in “patch available” state.•  We should keep metadata and data permissions in sync•  Proposal: StorageHandlerAuthorizationProviders – HdfsAuthorizationProvider – HBaseAuthorizationProvider Architecting the Future of Big Data Page 35 © Hortonworks Inc. 2011
  • Storage Handler Authorization Providers•  Allow read/write/modify access to metadata only if user has access to underlying data•  Warehouse admin is only concerned with access control on the data layer, metadata access control is obtained for free•  Treat HDFS as a HiveStorageHandler•  Hdfs and Hbase have different ACL models, but can be mapped to Hive Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  • Future Work•  Improve on schema / type mapping•  Fully secure Hive deployment options•  Hbase bulk import improvements•  Filter pushdown: non key column filters, BETWEEN 10 and 20 (PA)•  Hive random access support for Hbase –  https://cwiki.apache.org/HCATALOG/random-access-framework.html Architecting the Future of Big Data Page 37 © Hortonworks Inc. 2011
  • References•  Security –  https://issues.apache.org/jira/browse/HIVE-2764 –  https://issues.apache.org/jira/browse/HBASE-5371 –  https://issues.apache.org/jira/browse/HCATALOG-245 –  https://issues.apache.org/jira/browse/HCATALOG-260 –  https://issues.apache.org/jira/browse/HCATALOG-244 –  https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security+Design•  Type mapping / Filter Pushdown –  https://issues.apache.org/jira/browse/HIVE-1634 –  https://issues.apache.org/jira/browse/HIVE-1226 –  https://issues.apache.org/jira/browse/HIVE-1643 –  https://issues.apache.org/jira/browse/HIVE-2815 –  https://issues.apache.org/jira/browse/HIVE-1643•  Misc –  https://issues.apache.org/jira/browse/HIVE-2748 Architecting the Future of Big Data Page 38 © Hortonworks Inc. 2011
  • ThanksQuestions? Architecting the Future of Big Data Page 39 © Hortonworks Inc. 2011