SlideShare a Scribd company logo
1 of 39
Using Apache Hive
with HBase
And Recent Improvements
Enis Soztutar
enis [at] apache [dot] org
@enissoz




© Hortonworks Inc. 2011      Page 1
Outline of the talk
• Apache Hive 101
• Apache Hbase 101
• Hive + Hbase Motivation
• Schema Mapping
• Type Mapping
• Bulk Load
• Filter Pushdown
• Security Aspects
• Future Work




     Architecting the Future of Big Data
                                           Page 2
     © Hortonworks Inc. 2011
Apache Hive 101
• Hive is a data warehouse system for Hadoop
• SQL-like query language called HiveQL
• Built for PB scale data
• Main purpose is analysis and ad hoc querying
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• Very extensible
• Not for : small data sets, low latency queries, OLTP




      Architecting the Future of Big Data
                                                       Page 3
      © Hortonworks Inc. 2011
Apache Hive Architecture
                                     JDBC/
                                     ODBC


                                 Hive Thrift       Hive Web
     CLI
                                  Server           Interface

                                                               M
  Driver                                                       S
             Parser                            Planner         C
                                                               li   Metastore
                                                               e
          Execution                            Optimizer       n
                                                               t
    MapReduce

                                      HDFS                           RDBMS

     Architecting the Future of Big Data
                                                                                Page 4
     © Hortonworks Inc. 2011
Apache HBase101
•  Apache Hbase is the Hadoop database
•  Modeled after Google’s BigTable
•  A sparse, distributed, persistent multi- dimensional sorted map.
•  The map is indexed by a row key, column key, and a timestamp.
•  Each value in the map is an uninterpreted array of bytes.
•  Low latency random data access
•  Logical view:




                                             From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al.




       Architecting the Future of Big Data
                                                                                                                               Page 5
       © Hortonworks Inc. 2011
Apache HBase Architecture

   Client

                                               HMaster


                                                                   Zooke
Region                          Region                   Region     eper
server                          server                   server
 Region                              Region               Region


 Region                              Region               Region



                                                  DFS

         Architecting the Future of Big Data
                                                                           Page 6
         © Hortonworks Inc. 2011
Hive + HBase
HBase goes SQL




Architecting the Future of Big Data
                                      Page 7
© Hortonworks Inc. 2011
Hive + HBase motivation
• Hive datawarehouses on Hadoop are high latency
   – Long ETL times
   – Access to real time data
• Analyzing HBase data with MapReduce requires
  custom coding
• Hive and SQL are already known by many analysts




     Architecting the Future of Big Data
                                                    Page 8
     © Hortonworks Inc. 2011
Use Case 1: Hbase as ETL Data Sink




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

            Architecting the Future of Big Data
                                                                                 Page 9
            © Hortonworks Inc. 2011
Use Case 2: Hbase as data source




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

            Architecting the Future of Big Data
                                                                                 Page 10
            © Hortonworks Inc. 2011
Use Case 3: Low latency warehouse




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010

            Architecting the Future of Big Data
                                                                                 Page 11
            © Hortonworks Inc. 2011
Hive + HBase Example
CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")

TBLPROPERTIES
("hbase.table.name" = ”short_urls");


     Architecting the Future of Big Data
                                                     Page 12
     © Hortonworks Inc. 2011
Storage Handler
• Hive defines HiveStorageHandler class for different
  storage backends: Hbase/ Cassandra / MongoDB/ etc
• Storage Handler has hooks for
  – getInput/OutputFormat()
  – getSerde()
  – configureTableJobProperties()
  – Meta data operations hook: CREATE TABLE, DROP
    TABLE, etc
  – getAuthorizationProvider()
• Storage Handler is a table level concept. Does not
  support Hive partitions, and buckets.


     Architecting the Future of Big Data
                                                       Page 13
     © Hortonworks Inc. 2011
Apache Hive+HBase Architecture
                                 Hive Thrift       Hive Web
     CLI
                                  Server           Interface

                                                               M
  Driver                                                       S
             Parser                            Planner         C
                                                               li   Metastore
                                                               e
          Execution                            Optimizer       n
                                                               t


                                             StorageHandler

    MapReduce                                    HBase

                                      HDFS                           RDBMS

     Architecting the Future of Big Data
                                                                                Page 14
     © Hortonworks Inc. 2011
Hive + HBase
• For Input/OutputFormat, getSplits(), etc underlying
  Hbase classes are used
• Column selection and certain filters can be pushed
  down
• Hbase tables can be used with other(Hadoop native)
  tables and SQL constructs
• Hive DDL operations are converted to Hbase DDL
  operations via the client hook.
  – All operations are performed by the client
  – No two phase commit




     Architecting the Future of Big Data
                                                    Page 15
     © Hortonworks Inc. 2011
Schema / Type Mapping




Architecting the Future of Big Data
                                      Page 16
© Hortonworks Inc. 2011
Schema Mapping
•  Hive schema has table + columns + column types, Hbase
   has table + column families (+ column qualifiers)
•  Hive’s table schema is mapped by:
("hbase.columns.mapping" = ":key, u:url, s:hits")
•  Every field in Hive table is mapped in order to either
   – The table key (using :key as selector)
   – A column family (cf:)
   – A column (cf:cq)
•  Hbase column family can only be mapped to Hive MAP type
•  Hive table does not need to include all columns in Hbase
•  Map<Key,Value> is converted to CF:Key->Value



       Architecting the Future of Big Data
                                                            Page 17
       © Hortonworks Inc. 2011
Type Mapping
•  Recently added to Hive
•  Previously all types were being converted to strings in
   Hbase
•  Hive has:
   – Primitive types: INT, STRING, BINARY, DATE, etc
   – ARRAY<Type>
   – MAP<PrimitiveType, Type>
   – STRUCT<a:INT, b:STRING, c:STRING>
   – UNIONTYPE<INT,STRING>
•  Hbase does not have types
   – Bytes.toBytes()




       Architecting the Future of Big Data
                                                             Page 18
       © Hortonworks Inc. 2011
Type Mapping
• Table level property
  "hbase.table.default.storage.type” =
  “binary”
• ("hbase.columns.mapping" = ":key#binary,
  u:url#binary, s:hits#binary")
•  Type mapping is given after #, and can be
   – Any prefix of “binary” , eg u:url#b
   – Any prefix of “string” , eg u:url#s
   – The dash char “-” , eg u:url#-
•  String means use UTF8 serialization
•  Binary means use binary serialization for primitive types
   and maps
• Dash means use table level
      Architecting the Future of Big Data
                                                               Page 19
      © Hortonworks Inc. 2011
Type Mapping
•  If the type is not a primitive or Map, it is converted to a
   JSON string and serialized
•  Still a few rough edges for schema and type mapping:
   – No Hive BINARY support in Hbase mapping
   – No mapping of Hbase timestamp (can only provide put timestamp)
   – No arbitrary mapping of Structs / Arrays into hbase schema




       Architecting the Future of Big Data
                                                                 Page 20
       © Hortonworks Inc. 2011
Bulk Load
•  Steps to bulk load:
   – Sample source data for range partitioning
   – Save sampling results to a file
   – Run CLUSTER BY query using HiveHFileOutputFormat and
     TotalOrderPartitioner
   – Import Hfiles into Hbase table
• Ideal setup should be
   SET hive.hbase.bulk=true
   INSERT OVERWRITE TABLE web_table SELECT ….




      Architecting the Future of Big Data
                                                            Page 21
      © Hortonworks Inc. 2011
Filter Pushdown




Architecting the Future of Big Data
                                      Page 22
© Hortonworks Inc. 2011
Filter Pushdown
•  Idea is to pass down filter expressions to the storage layer
   to minimize scanned data
•  Use cases are:
    – Pushing filters to access plans so that indexes can be used
    – Pushing filters to RCFiles
    – Pushing filters to StorageHandlers (HBase)
•  Example:
   CREATE EXTERNAL TABLE users (userid LONG, email STRING, … )
   STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
   WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…")

   SELECT ... FROM users WHERE userid > 1000000 and email LIKE
‘%@gmail.com’;

-> scan.setStartRow(Bytes.toBytes(1000000))


       Architecting the Future of Big Data
                                                                    Page 23
       © Hortonworks Inc. 2011
Filter Decomposition
•  Optimizer pushes down the predicates to the query plan
•  Storage handlers can negotiate with the Hive optimizer to
   decompose the filter.
• x > 3 AND upper(y) = 'XYZ’
• Handle x > 3, send upper(y) = ’XYZ’ as residual
   for Hive to deal with.
•  Optional HiveStoragePredicateHandler interface,
   defines decomposePredicate(…, ExprNodeDesc
   predicate)
•  Works with
 key = 3, key > 3, etc
•  key > 3 AND key < 100 in Patch Available
•  Only works against constant expressions

       Architecting the Future of Big Data
                                                           Page 24
       © Hortonworks Inc. 2011
Security Aspects
Towards fully secure deployments




Architecting the Future of Big Data
                                      Page 25
© Hortonworks Inc. 2011
Security – Big Picture
• Security becomes more important to support
  enterprise level and multi tenant applications
• 5 Different Components to ensure / impose security
    – HDFS
    – MapReduce
    – Hbase
    – Zookeeper
    – Hive
• Each component has
    – Authentication
    – Authorization
• What about Hcatalog ?
     Architecting the Future of Big Data
                                                       Page 26
     © Hortonworks Inc. 2011
Security Components
•  HDFS                                         •  HBase
   –  Authentication                                –  Authentication
           –  Kerberos                                     –  Kerberos

           –  NN + Block Access tokens                     –  Hbase delegation tokens

   –  Posix-like File permissions                   –  Global / Table / CF / Column level ACLs
•  MapReduce                                    •  Zookeeper
   –  Authentication                                –  Pluggable authentication
           –  Kerberos                              –  Kerberos authentication (SASL)
           –  Tokens stored in Job                  –  Znode ACLs
                                                •  Hive
           –  Mapreduce delegation tokens
              (Oozie)
                                                    –  MS authentication
                                                           –  Kerberos
   –  Authorization
                                                           –  MS Delegation tokens
           –  Job / Queue ACLs

           –  Service-Level ACLs                    –  Pluggable authorization
                                                           –  Mysql-like Role based access control


          Architecting the Future of Big Data
                                                                                            Page 27
          © Hortonworks Inc. 2011
Hbase Security – Closer look
•  Released with Hbase 0.92
•  Fully optional module, disabled by default
•  Needs an underlying secure Hadoop release
•  SecureRPCEngine: optional engine enforcing SASL
   authentication
   –  Kerberos
   –  DIGEST-MD5 based tokens
   –  TokenProvider coprocessor
•  Access control is implemented as a Coprocessor:
   AccessController
•  Stores and distributes ACL data via Zookeeper
     – Sensitive data is only accessible by hbase daemons
     – Client does not need to authenticate to zk


       Architecting the Future of Big Data
                                                            Page 28
       © Hortonworks Inc. 2011
Hbase Security – Closer look
•  The ACL lists can be defined at the global/table/column
   family or column qualifier level for users and groups
•  No roles yet
•  There are 5 actions (privileges): READ, WRITE, EXEC,
   CREATE, and ADMIN (RWXCA)

grant 'bobsmith', 'RW', 't1' [,'f1']
[,'col1']
revoke 'bobsmith', 't1', 'f1' [,'f1']
[,'col1']
user_permission 'table1'



      Architecting the Future of Big Data
                                                             Page 29
      © Hortonworks Inc. 2011
Hbase Security – Closer look
• create/drop/alter table operations are associated with
  global level CREATE/DROP or ADMIN permissions.
• put/get/scan operations are defined as per table/cf/cq
  1. All users need read access to .META. and ROOT tables.
  2. The table owner has full privileges
  3. check for the table-level, if successful we can short-circuit
  4. check permissions against the requested families
  for all families
  a) check for family level access
  b) if no family level grant is found, check for qualifier level
  access
  5. no families to check and table level access failed, deny the
  request.


     Architecting the Future of Big Data
                                                               Page 30
     © Hortonworks Inc. 2011
Hive Security – Closer look
•  Hive has different deployment options, security
   considerations should take into account different
   deployments
•  Authentication is only supported at Metastore, not on
   HiveServer, web interface, JDBC
•  Authorization is enforced at the query layer (Driver)
•  Pluggable authorization providers. Default one stores
   global/table/partiton/column permissions in metastore.
GRANT ALTER ON TABLE web_table TO USER bob;
CREATE ROLE db_reader
GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO
ROLE db_reader


      Architecting the Future of Big Data
                                                       Page 31
      © Hortonworks Inc. 2011
Hive Deployment Option 1


client

         CLI


                                                                     M
   Driver                                        Authorization
                                                                     S
                 Parser                              Planner         C    Authentication
                                                                     li   Metastore
                                                                     e
              Execution                              Optimizer       n
                                                                     t
   A/A                                         A/A
         MapReduce                                    HBase
                                                                           A12n/A11N
                                                         A12n/A11N         RDBMS
                                          HDFS

         Architecting the Future of Big Data
                                                                                           Page 32
         © Hortonworks Inc. 2011
Hive Deployment Option 2


client

         CLI


                                                                     M
   Driver                                        Authorization
                                                                     S
                 Parser                              Planner         C    Authentication
                                                                     li   Metastore
                                                                     e
              Execution                              Optimizer       n
                                                                     t
   A/A                                         A/A
         MapReduce                                    HBase
                                                                           A12n/A11N
                                                         A12n/A11N         RDBMS
                                          HDFS

         Architecting the Future of Big Data
                                                                                           Page 33
         © Hortonworks Inc. 2011
Hive Deployment Option 3
client
                                         JDBC/
                                         ODBC


                                     Hive Thrift         Hive Web
         CLI
                                      Server             Interface

                                                                     M
   Driver                                        Authorization
                                                                     S
                 Parser                              Planner         C    Authentication
                                                                     li   Metastore
                                                                     e
              Execution                              Optimizer       n
                                                                     t
   A/A                                         A/A
         MapReduce                                    HBase
                                                                           A12n/A11N
                                                         A12n/A11N
                                          HDFS                             RDBMS

         Architecting the Future of Big Data
                                                                                           Page 34
         © Hortonworks Inc. 2011
Hive + Hbase + Hadoop Security
•  So, regardless of Hive’s own security, for Hive to work on
   secure Hadoop and Hbase, we should:
   – Obtain delegation tokens for Hadoop and HBase jobs
   – Ensure to obey the storage level (hdfs,hbase) permission checks
   – In HiveServer deployments, authenticate and impersonate the
     user
•  Delegation tokens for Hadoop are already working.
   Obtaining Hbase delegation tokens are in “patch available”
   state.
•  We should keep metadata and data permissions in sync
•  Proposal: StorageHandlerAuthorizationProviders
   – HdfsAuthorizationProvider
   – HBaseAuthorizationProvider


      Architecting the Future of Big Data
                                                                  Page 35
      © Hortonworks Inc. 2011
Storage Handler Authorization Providers

•  Allow read/write/modify access to metadata only if user
   has access to underlying data
•  Warehouse admin is only concerned with access control
   on the data layer, metadata access control is obtained for
   free
•  Treat HDFS as a HiveStorageHandler
•  Hdfs and Hbase have different ACL models, but can be
   mapped to Hive




      Architecting the Future of Big Data
                                                            Page 36
      © Hortonworks Inc. 2011
Future Work
•  Improve on schema / type mapping
•  Fully secure Hive deployment options
•  Hbase bulk import improvements
•  Filter pushdown: non key column filters, BETWEEN 10 and 20 (PA)
•  Hive random access support for Hbase
   –  https://cwiki.apache.org/HCATALOG/random-access-framework.html




       Architecting the Future of Big Data
                                                                       Page 37
       © Hortonworks Inc. 2011
References
•  Security
    –  https://issues.apache.org/jira/browse/HIVE-2764
    –  https://issues.apache.org/jira/browse/HBASE-5371
    –  https://issues.apache.org/jira/browse/HCATALOG-245
    –  https://issues.apache.org/jira/browse/HCATALOG-260
    –  https://issues.apache.org/jira/browse/HCATALOG-244
    –  https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security+Design
•  Type mapping / Filter Pushdown
    –  https://issues.apache.org/jira/browse/HIVE-1634
    –  https://issues.apache.org/jira/browse/HIVE-1226
    –  https://issues.apache.org/jira/browse/HIVE-1643
    –  https://issues.apache.org/jira/browse/HIVE-2815
    –  https://issues.apache.org/jira/browse/HIVE-1643
•  Misc
    –  https://issues.apache.org/jira/browse/HIVE-2748




          Architecting the Future of Big Data
                                                                                   Page 38
          © Hortonworks Inc. 2011
Thanks
Questions?




    Architecting the Future of Big Data
                                          Page 39
    © Hortonworks Inc. 2011

More Related Content

What's hot

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0enissoz
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop User Group
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache HiveCarl Steinbach
 

What's hot (20)

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
August 2014 HUG : Hive 13 Security
August 2014 HUG : Hive 13 SecurityAugust 2014 HUG : Hive 13 Security
August 2014 HUG : Hive 13 Security
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Apache hive
Apache hiveApache hive
Apache hive
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 

Viewers also liked

HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambariHortonworks
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 

Viewers also liked (8)

HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Structured streaming in Spark
Structured streaming in SparkStructured streaming in Spark
Structured streaming in Spark
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 

Similar to Using Apache Hive with HBase: Schema Mapping and Bulk Load

HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善HortonworksJapan
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 

Similar to Using Apache Hive with HBase: Schema Mapping and Bulk Load (20)

HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Hive
HiveHive
Hive
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
מיכאל
מיכאלמיכאל
מיכאל
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Using Apache Hive with HBase: Schema Mapping and Bulk Load

  • 1. Using Apache Hive with HBase And Recent Improvements Enis Soztutar enis [at] apache [dot] org @enissoz © Hortonworks Inc. 2011 Page 1
  • 2. Outline of the talk • Apache Hive 101 • Apache Hbase 101 • Hive + Hbase Motivation • Schema Mapping • Type Mapping • Bulk Load • Filter Pushdown • Security Aspects • Future Work Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Apache Hive 101 • Hive is a data warehouse system for Hadoop • SQL-like query language called HiveQL • Built for PB scale data • Main purpose is analysis and ad hoc querying • Database / table / partition / bucket – DDL Operations • SQL Types + Complex Types (ARRAY, MAP, etc) • Very extensible • Not for : small data sets, low latency queries, OLTP Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Apache Hive Architecture JDBC/ ODBC Hive Thrift Hive Web CLI Server Interface M Driver S Parser Planner C li Metastore e Execution Optimizer n t MapReduce HDFS RDBMS Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Apache HBase101 •  Apache Hbase is the Hadoop database •  Modeled after Google’s BigTable •  A sparse, distributed, persistent multi- dimensional sorted map. •  The map is indexed by a row key, column key, and a timestamp. •  Each value in the map is an uninterpreted array of bytes. •  Low latency random data access •  Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. Apache HBase Architecture Client HMaster Zooke Region Region Region eper server server server Region Region Region Region Region Region DFS Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. Hive + HBase HBase goes SQL Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Hive + HBase motivation • Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Use Case 1: Hbase as ETL Data Sink From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Use Case 2: Hbase as data source From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Use Case 3: Low latency warehouse From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Hive + HBase Example CREATE TABLE short_urls( short_url string, url string, hit_count int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits") TBLPROPERTIES ("hbase.table.name" = ”short_urls"); Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Storage Handler • Hive defines HiveStorageHandler class for different storage backends: Hbase/ Cassandra / MongoDB/ etc • Storage Handler has hooks for – getInput/OutputFormat() – getSerde() – configureTableJobProperties() – Meta data operations hook: CREATE TABLE, DROP TABLE, etc – getAuthorizationProvider() • Storage Handler is a table level concept. Does not support Hive partitions, and buckets. Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Apache Hive+HBase Architecture Hive Thrift Hive Web CLI Server Interface M Driver S Parser Planner C li Metastore e Execution Optimizer n t StorageHandler MapReduce HBase HDFS RDBMS Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Hive + HBase • For Input/OutputFormat, getSplits(), etc underlying Hbase classes are used • Column selection and certain filters can be pushed down • Hbase tables can be used with other(Hadoop native) tables and SQL constructs • Hive DDL operations are converted to Hbase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Schema / Type Mapping Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Schema Mapping •  Hive schema has table + columns + column types, Hbase has table + column families (+ column qualifiers) •  Hive’s table schema is mapped by: ("hbase.columns.mapping" = ":key, u:url, s:hits") •  Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) – A column (cf:cq) •  Hbase column family can only be mapped to Hive MAP type •  Hive table does not need to include all columns in Hbase •  Map<Key,Value> is converted to CF:Key->Value Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Type Mapping •  Recently added to Hive •  Previously all types were being converted to strings in Hbase •  Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING> – UNIONTYPE<INT,STRING> •  Hbase does not have types – Bytes.toBytes() Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Type Mapping • Table level property "hbase.table.default.storage.type” = “binary” • ("hbase.columns.mapping" = ":key#binary, u:url#binary, s:hits#binary") •  Type mapping is given after #, and can be – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#- •  String means use UTF8 serialization •  Binary means use binary serialization for primitive types and maps • Dash means use table level Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. Type Mapping •  If the type is not a primitive or Map, it is converted to a JSON string and serialized •  Still a few rough edges for schema and type mapping: – No Hive BINARY support in Hbase mapping – No mapping of Hbase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into hbase schema Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Bulk Load •  Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into Hbase table • Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Filter Pushdown Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Filter Pushdown •  Idea is to pass down filter expressions to the storage layer to minimize scanned data •  Use cases are: – Pushing filters to access plans so that indexes can be used – Pushing filters to RCFiles – Pushing filters to StorageHandlers (HBase) •  Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE ‘%@gmail.com’; -> scan.setStartRow(Bytes.toBytes(1000000)) Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Filter Decomposition •  Optimizer pushes down the predicates to the query plan •  Storage handlers can negotiate with the Hive optimizer to decompose the filter. • x > 3 AND upper(y) = 'XYZ’ • Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive to deal with. •  Optional HiveStoragePredicateHandler interface, defines decomposePredicate(…, ExprNodeDesc predicate) •  Works with key = 3, key > 3, etc •  key > 3 AND key < 100 in Patch Available •  Only works against constant expressions Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Security Aspects Towards fully secure deployments Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Security – Big Picture • Security becomes more important to support enterprise level and multi tenant applications • 5 Different Components to ensure / impose security – HDFS – MapReduce – Hbase – Zookeeper – Hive • Each component has – Authentication – Authorization • What about Hcatalog ? Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Security Components •  HDFS •  HBase –  Authentication –  Authentication –  Kerberos –  Kerberos –  NN + Block Access tokens –  Hbase delegation tokens –  Posix-like File permissions –  Global / Table / CF / Column level ACLs •  MapReduce •  Zookeeper –  Authentication –  Pluggable authentication –  Kerberos –  Kerberos authentication (SASL) –  Tokens stored in Job –  Znode ACLs •  Hive –  Mapreduce delegation tokens (Oozie) –  MS authentication –  Kerberos –  Authorization –  MS Delegation tokens –  Job / Queue ACLs –  Service-Level ACLs –  Pluggable authorization –  Mysql-like Role based access control Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Hbase Security – Closer look •  Released with Hbase 0.92 •  Fully optional module, disabled by default •  Needs an underlying secure Hadoop release •  SecureRPCEngine: optional engine enforcing SASL authentication –  Kerberos –  DIGEST-MD5 based tokens –  TokenProvider coprocessor •  Access control is implemented as a Coprocessor: AccessController •  Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by hbase daemons – Client does not need to authenticate to zk Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Hbase Security – Closer look •  The ACL lists can be defined at the global/table/column family or column qualifier level for users and groups •  No roles yet •  There are 5 actions (privileges): READ, WRITE, EXEC, CREATE, and ADMIN (RWXCA) grant 'bobsmith', 'RW', 't1' [,'f1'] [,'col1'] revoke 'bobsmith', 't1', 'f1' [,'f1'] [,'col1'] user_permission 'table1' Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. Hbase Security – Closer look • create/drop/alter table operations are associated with global level CREATE/DROP or ADMIN permissions. • put/get/scan operations are defined as per table/cf/cq 1. All users need read access to .META. and ROOT tables. 2. The table owner has full privileges 3. check for the table-level, if successful we can short-circuit 4. check permissions against the requested families for all families a) check for family level access b) if no family level grant is found, check for qualifier level access 5. no families to check and table level access failed, deny the request. Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Hive Security – Closer look •  Hive has different deployment options, security considerations should take into account different deployments •  Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC •  Authorization is enforced at the query layer (Driver) •  Pluggable authorization providers. Default one stores global/table/partiton/column permissions in metastore. GRANT ALTER ON TABLE web_table TO USER bob; CREATE ROLE db_reader GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO ROLE db_reader Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. Hive Deployment Option 1 client CLI M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N RDBMS HDFS Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33. Hive Deployment Option 2 client CLI M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N RDBMS HDFS Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34. Hive Deployment Option 3 client JDBC/ ODBC Hive Thrift Hive Web CLI Server Interface M Driver Authorization S Parser Planner C Authentication li Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35. Hive + Hbase + Hadoop Security •  So, regardless of Hive’s own security, for Hive to work on secure Hadoop and Hbase, we should: – Obtain delegation tokens for Hadoop and HBase jobs – Ensure to obey the storage level (hdfs,hbase) permission checks – In HiveServer deployments, authenticate and impersonate the user •  Delegation tokens for Hadoop are already working. Obtaining Hbase delegation tokens are in “patch available” state. •  We should keep metadata and data permissions in sync •  Proposal: StorageHandlerAuthorizationProviders – HdfsAuthorizationProvider – HBaseAuthorizationProvider Architecting the Future of Big Data Page 35 © Hortonworks Inc. 2011
  • 36. Storage Handler Authorization Providers •  Allow read/write/modify access to metadata only if user has access to underlying data •  Warehouse admin is only concerned with access control on the data layer, metadata access control is obtained for free •  Treat HDFS as a HiveStorageHandler •  Hdfs and Hbase have different ACL models, but can be mapped to Hive Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  • 37. Future Work •  Improve on schema / type mapping •  Fully secure Hive deployment options •  Hbase bulk import improvements •  Filter pushdown: non key column filters, BETWEEN 10 and 20 (PA) •  Hive random access support for Hbase –  https://cwiki.apache.org/HCATALOG/random-access-framework.html Architecting the Future of Big Data Page 37 © Hortonworks Inc. 2011
  • 38. References •  Security –  https://issues.apache.org/jira/browse/HIVE-2764 –  https://issues.apache.org/jira/browse/HBASE-5371 –  https://issues.apache.org/jira/browse/HCATALOG-245 –  https://issues.apache.org/jira/browse/HCATALOG-260 –  https://issues.apache.org/jira/browse/HCATALOG-244 –  https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security+Design •  Type mapping / Filter Pushdown –  https://issues.apache.org/jira/browse/HIVE-1634 –  https://issues.apache.org/jira/browse/HIVE-1226 –  https://issues.apache.org/jira/browse/HIVE-1643 –  https://issues.apache.org/jira/browse/HIVE-2815 –  https://issues.apache.org/jira/browse/HIVE-1643 •  Misc –  https://issues.apache.org/jira/browse/HIVE-2748 Architecting the Future of Big Data Page 38 © Hortonworks Inc. 2011
  • 39. Thanks Questions? Architecting the Future of Big Data Page 39 © Hortonworks Inc. 2011