SlideShare a Scribd company logo
1 of 39
Download to read offline
Integration of Apache Hive
and HBase
Enis Soztutar
enis [at] apache [dot] org
@enissoz




Architecting the Future of Big Data
 © Hortonworks Inc. 2011              Page 1
About Me

•  User and committer of Hadoop since 2007
•  Contributor to Apache Hadoop, HBase, Hive and Gora
•  Joined Hortonworks as Member of Technical Staff
•  Twitter: @enissoz




        Architecting the Future of Big Data
                                                        Page 2
        © Hortonworks Inc. 2011
Agenda

•  Overview of Hive and HBase
•  Hive + HBase Features and Improvements
•  Future of Hive and HBase
•  Q&A




         Architecting the Future of Big Data
                                               Page 3
         © Hortonworks Inc. 2011
Apache Hive Overview
• Apache Hive is a data warehouse system for Hadoop
• SQL-like query language called HiveQL
• Built for PB scale data
• Main purpose is analysis and ad hoc querying
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• Very extensible
• Not for : small data sets, low latency queries, OLTP



         Architecting the Future of Big Data
                                                           Page 4
         © Hortonworks Inc. 2011
Apache Hive Architecture
                                  JDBC/ODBC




                                     Hive Thrift        Hive Web
       CLI
                                      Server            Interface



    Driver                                                          M
                                                                    S
                                                                    C
                 Parser                            Planner          l   Metastore
                                                                    i
                                                                    e
              Execution                            Optimizer        n
                                                                    t

        MapReduce

                                         HDFS                            RDBMS

       Architecting the Future of Big Data
                                                                                    Page 5
       © Hortonworks Inc. 2011
Overview of Apache HBase
• Apache HBase is the Hadoop database
• Modeled after Google’s BigTable
• A sparse, distributed, persistent multi- dimensional sorted
  map
• The map is indexed by a row key, column key, and a
  timestamp
• Each value in the map is an un-interpreted array of bytes
• Low latency random data access




         Architecting the Future of Big Data
                                                                Page 6
         © Hortonworks Inc. 2011
Overview of Apache HBase
• Logical view:




                                               From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al.




         Architecting the Future of Big Data
                                                                                                                                 Page 7
         © Hortonworks Inc. 2011
Apache HBase Architecture

            Client

                                                HMaster



                                                                     Zookeeper
    Region                                 Region         Region
    server                                 server         server
       Region                               Region          Region


       Region                               Region          Region



                                                     HDFS

     Architecting the Future of Big Data
                                                                                 Page 8
     © Hortonworks Inc. 2011
Hive + HBase Features and
Improvements




 Architecting the Future of Big Data
                                       Page 9
 © Hortonworks Inc. 2011
Hive + HBase Motivation
• Hive and HBase has different characteristics:
  High latency                                Low latency
  Structured                   vs.            Unstructured
  Analysts                                    Programmers

• Hive datawarehouses on Hadoop are high latency
  – Long ETL times
  – Access to real time data
• Analyzing HBase data with MapReduce requires custom
  coding
• Hive and SQL are already known by many analysts

        Architecting the Future of Big Data
                                                             Page 10
        © Hortonworks Inc. 2011
Use Case 1: HBase as ETL Data Sink




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 11
                  © Hortonworks Inc. 2011
Use Case 2: HBase as Data Source




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 12
                  © Hortonworks Inc. 2011
Use Case 3: Low Latency Warehouse




From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010


                  Architecting the Future of Big Data
                                                                                 Page 13
                  © Hortonworks Inc. 2011
Example: Hive + Hbase (HBase table)
hbase(main):001:0> create 'short_urls', {NAME =>
'u'}, {NAME=>'s'}



hbase(main):014:0> scan 'short_urls'

ROW                   COLUMN+CELL
 bit.ly/aaaa          column=s:hits, value=100
 bit.ly/aaaa          column=u:url,
value=hbase.apache.org/
 bit.ly/abcd          column=s:hits, value=123
 bit.ly/abcd          column=u:url,
value=example.com/foo
      Architecting the Future of Big Data
                                                 Page 14
      © Hortonworks Inc. 2011
Example: Hive + HBase (Hive table)
CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")

TBLPROPERTIES
("hbase.table.name" = ”short_urls");
       Architecting the Future of Big Data
                                                Page 15
       © Hortonworks Inc. 2011
Storage Handler
• Hive defines HiveStorageHandler class for different storage
  backends: HBase/ Cassandra / MongoDB/ etc
• Storage Handler has hooks for
  –  Getting input / output formats
  –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc
• Storage Handler is a table level concept
  –  Does not support Hive partitions, and buckets




         Architecting the Future of Big Data
                                                           Page 16
         © Hortonworks Inc. 2011
Apache Hive + HBase Architecture
                                           Hive Thrift         Hive Web
                   CLI
                                            Server             Interface


           Driver                                                          M
                                                                           S
                            Parser                         Planner         C
                                                                           l   Metastore
                                                                           i
                         Execution                         Optimizer       e
                                                                           n
                                                                           t


                                                         StorageHandler


                  MapReduce                                  HBase

                                             HDFS                              RDBMS

     Architecting the Future of Big Data
                                                                                           Page 17
     © Hortonworks Inc. 2011
Hive + HBase Integration
• For Input/OutputFormat, getSplits(), etc underlying HBase
  classes are used
• Column selection and certain filters can be pushed down
• HBase tables can be used with other(Hadoop native) tables
  and SQL constructs
• Hive DDL operations are converted to HBase DDL
  operations via the client hook.
  – All operations are performed by the client
  – No two phase commit




        Architecting the Future of Big Data
                                                          Page 18
        © Hortonworks Inc. 2011
Schema / Type Mapping




Architecting the Future of Big Data
                                      Page 19
© Hortonworks Inc. 2011
Schema Mapping
•  Hive table + columns + column types <=> HBase table + column
   families (+ column qualifiers)
•  Every field in Hive table is mapped in order to either
   – The table key (using :key as selector)
   – A column family (cf:) -> MAP fields in Hive
   – A column (cf:cq)
•  Hive table does not need to include all columns in HBase
•  CREATE TABLE short_urls(
       short_url string,
       url string,
       hit_count int,
       props, map<string,string>
   )
   WITH SERDEPROPERTIES
   ("hbase.columns.mapping" = ":key, u:url, s:hits, p:")

          Architecting the Future of Big Data
                                                              Page 20
          © Hortonworks Inc. 2011
Type Mapping
• Recently added to Hive (0.9.0)
• Previously all types were being converted to strings in HBase
• Hive has:
  – Primitive types: INT, STRING, BINARY, DATE, etc
  – ARRAY<Type>
  – MAP<PrimitiveType, Type>
  – STRUCT<a:INT, b:STRING, c:STRING>
• HBase does not have types
  – Bytes.toBytes()




        Architecting the Future of Big Data
                                                            Page 21
        © Hortonworks Inc. 2011
Type Mapping
• Table level property
  "hbase.table.default.storage.type” = “binary”
• Type mapping can be given per column after #
  – Any prefix of “binary” , eg u:url#b
  – Any prefix of “string” , eg u:url#s
  – The dash char “-” , eg u:url#-

CREATE TABLE short_urls(
   short_url string,
   url string,
   hit_count int,
   props, map<string,string>
)
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")

        Architecting the Future of Big Data                  Page 22
        © Hortonworks Inc. 2011
Type Mapping
• If the type is not a primitive or Map, it is converted to a JSON
  string and serialized
• Still a few rough edges for schema and type mapping:
   – No Hive BINARY support in HBase mapping
   – No mapping of HBase timestamp (can only provide put
     timestamp)
   – No arbitrary mapping of Structs / Arrays into HBase schema




         Architecting the Future of Big Data
                                                                  Page 23
         © Hortonworks Inc. 2011
Bulk Load
• Steps to bulk load:
   – Sample source data for range partitioning
   – Save sampling results to a file
   – Run CLUSTER BY query using HiveHFileOutputFormat and
     TotalOrderPartitioner
   – Import Hfiles into HBase table
• Ideal setup should be
   SET hive.hbase.bulk=true
   INSERT OVERWRITE TABLE web_table SELECT ….




        Architecting the Future of Big Data
                                                        Page 24
        © Hortonworks Inc. 2011
Filter Pushdown




Architecting the Future of Big Data
                                      Page 25
© Hortonworks Inc. 2011
Filter Pushdown
• Idea is to pass down filter expressions to the storage layer to
  minimize scanned data
• To access indexes at HDFS or HBase
• Example:
   CREATE EXTERNAL TABLE users (userid LONG, email STRING, … )
   STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
   WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…")


   SELECT ... FROM users WHERE userid > 1000000 and email LIKE
‘%@gmail.com’;



-> scan.setStartRow(Bytes.toBytes(1000000))

         Architecting the Future of Big Data
                                                                  Page 26
         © Hortonworks Inc. 2011
Filter Decomposition
• Optimizer pushes down the predicates to the query plan
• Storage handlers can negotiate with the Hive optimizer to
  decompose the filter
  x > 3 AND upper(y) = 'XYZ’
• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive
• Works with:
key = 3, key > 3, etc
key > 3 AND key < 100
• Only works against constant expressions


        Architecting the Future of Big Data
                                                              Page 27
        © Hortonworks Inc. 2011
Security Aspects
Towards fully secure deployments




Architecting the Future of Big Data
                                      Page 28
© Hortonworks Inc. 2011
Security – Big Picture
• Security becomes more important to support enterprise level
  and multi tenant applications
• 5 Different Components to ensure / impose security
  – HDFS
  – MapReduce
  – HBase
  – Zookeeper
  – Hive
• Each component has:
  – Authentication
  – Authorization


         Architecting the Future of Big Data
                                                           Page 29
         © Hortonworks Inc. 2011
HBase Security – Closer look
• Released with HBase 0.92
• Fully optional module, disabled by default
• Needs an underlying secure Hadoop release
• SecureRPCEngine: optional engine enforcing SASL
  authentication
  – Kerberos
  – DIGEST-MD5 based tokens
  – TokenProvider coprocessor
• Access control is implemented as a Coprocessor:
  AccessController
• Stores and distributes ACL data via Zookeeper
  – Sensitive data is only accessible by HBase daemons
  – Client does not need to authenticate to zk
         Architecting the Future of Big Data
                                                         Page 30
         © Hortonworks Inc. 2011
Hive Security – Closer look
• Hive has different deployment options, security considerations
  should take into account different deployments
• Authentication is only supported at Metastore, not on
  HiveServer, web interface, JDBC
• Authorization is enforced at the query layer (Driver)
• Pluggable authorization providers. Default one stores global/
  table/partition/column permissions in Metastore

GRANT ALTER ON TABLE web_table TO USER bob;
CREATE ROLE db_reader
GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO
ROLE db_reader

        Architecting the Future of Big Data
                                                           Page 31
        © Hortonworks Inc. 2011
Hive Deployment Option 1
  Client


             CLI


     Driver                                                                 M
                                                          Authorization
                                                                            S
                                                                            C
                        Parser                              Planner         l   Authentication
                                                                            i
                                                                                 Metastore
                                                                            e
                    Execution                              Optimizer        n
                                                                            t

     A/A                                            A/A
             MapReduce                                       HBase

                                                                A12n/A11N       A12n/A11N
                                                 HDFS
                                                                                RDBMS
           Architecting the Future of Big Data
                                                                                          Page 32
           © Hortonworks Inc. 2011
Hive Deployment Option 2
 Client


            CLI



    Driver                                                                 M
                                                         Authorization
                                                                           S
                                                                           C
                       Parser                              Planner         l   Authentication
                                                                           i
                                                                           e    Metastore
                                                                           n
                   Execution                              Optimizer
                                                                           t

    A/A                                            A/A
            MapReduce                                       HBase

                                                               A12n/A11N       A12n/A11N
                                                HDFS
                                                                               RDBMS
          Architecting the Future of Big Data
                                                                                        Page 33
          © Hortonworks Inc. 2011
Hive Deployment Option 3
  Client
                                                 JDBC/ODBC




                                                 Hive Thrift            Hive Web
               CLI
                                                  Server                Interface


                                                                                    M
     Driver                                                     Authorization
                                                                                    S
                                                                                    C
                          Parser                                  Planner           l   Authentication
                                                                                    i     Metastore
                                                                                    e
                       Execution                                 Optimizer          n
                                                                                    t
     A/A                                                  A/A
                MapReduce                                           HBase
                                                                                        A12n/A11N
                                                   HDFS               A12n/A11N
                                                                                         RDBMS
           Architecting the Future of Big Data
                                                                                                Page 34
           © Hortonworks Inc. 2011
Hive + HBase + Hadoop Security
• Regardless of Hive’s own security, for Hive to work on
  secure Hadoop and HBase, we should:
  – Obtain delegation tokens for Hadoop and HBase jobs
  – Ensure to obey the storage level (HDFS, HBase) permission checks
  – In HiveServer deployments, authenticate and impersonate the user

• Delegation tokens for Hadoop are already working
• Obtaining HBase delegation tokens are released in Hive
  0.9.0




         Architecting the Future of Big Data
                                                                   Page 35
         © Hortonworks Inc. 2011
Future of Hive + HBase
• Improve on schema / type mapping
• Fully secure Hive deployment options
• HBase bulk import improvements
• Sortable signed numeric types in HBase
• Filter pushdown: non key column filters
• Hive random access support for HBase
  – https://cwiki.apache.org/HCATALOG/random-access-
    framework.html




        Architecting the Future of Big Data
                                                       Page 36
        © Hortonworks Inc. 2011
References
• Security
  – https://issues.apache.org/jira/browse/HIVE-2764
  – https://issues.apache.org/jira/browse/HBASE-5371
  – https://issues.apache.org/jira/browse/HCATALOG-245
  – https://issues.apache.org/jira/browse/HCATALOG-260
  – https://issues.apache.org/jira/browse/HCATALOG-244
  – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security
    +Design
• Type mapping / Filter Pushdown
  – https://issues.apache.org/jira/browse/HIVE-1634
  – https://issues.apache.org/jira/browse/HIVE-1226
  – https://issues.apache.org/jira/browse/HIVE-1643
  – https://issues.apache.org/jira/browse/HIVE-2815
  – https://issues.apache.org/jira/browse/HIVE-1643
        Architecting the Future of Big Data
                                                                 Page 37
        © Hortonworks Inc. 2011
Other Resources

• Hadoop Summit
  – June 13-14
  – San Jose, California
  – www.Hadoopsummit.org

• Hadoop Training and Certification
  – Developing Solutions Using Apache Hadoop
  – Administering Apache Hadoop
  – Online classes available US, India, EMEA
  – http://hortonworks.com/training/




        © Hortonworks Inc. 2012                Page 38
Thanks
Questions?




    Architecting the Future of Big Data
                                          Page 39
    © Hortonworks Inc. 2011

More Related Content

What's hot

IT Redhat Linux Resume
IT Redhat Linux Resume IT Redhat Linux Resume
IT Redhat Linux Resume
RAJESH KUMAR
 
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE -  Outside WorldPankaj Resume for Hadoop,Java,J2EE -  Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Kumar
 
Djangoアプリの実践的設計手法
Djangoアプリの実践的設計手法Djangoアプリの実践的設計手法
Djangoアプリの実践的設計手法
Ian Lewis
 

What's hot (20)

Cover letter
Cover letterCover letter
Cover letter
 
実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いた実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いた
 
幅広い技術力が身につくSalesforceエンジニアのススメ〜入門編〜
幅広い技術力が身につくSalesforceエンジニアのススメ〜入門編〜幅広い技術力が身につくSalesforceエンジニアのススメ〜入門編〜
幅広い技術力が身につくSalesforceエンジニアのススメ〜入門編〜
 
Raspberry Pi用のコンテナをクラウドでビルドする方法
Raspberry Pi用のコンテナをクラウドでビルドする方法Raspberry Pi用のコンテナをクラウドでビルドする方法
Raspberry Pi用のコンテナをクラウドでビルドする方法
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
 
Selenium WebDriver,Cypress,TestCafeの違いを調べてみました
Selenium WebDriver,Cypress,TestCafeの違いを調べてみましたSelenium WebDriver,Cypress,TestCafeの違いを調べてみました
Selenium WebDriver,Cypress,TestCafeの違いを調べてみました
 
Docker composeで開発環境をメンバに配布せよ
Docker composeで開発環境をメンバに配布せよDocker composeで開発環境をメンバに配布せよ
Docker composeで開発環境をメンバに配布せよ
 
コードの自動修正によって実現する、機能開発を止めないフレームワーク移行
コードの自動修正によって実現する、機能開発を止めないフレームワーク移行コードの自動修正によって実現する、機能開発を止めないフレームワーク移行
コードの自動修正によって実現する、機能開発を止めないフレームワーク移行
 
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
 
Google Cloud で実践する SRE
Google Cloud で実践する SRE  Google Cloud で実践する SRE
Google Cloud で実践する SRE
 
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
細かすぎて伝わらないかもしれない Azure Container Networking Deep Dive
 
ソフトウェアエンジニアリング知識体系SWEBOK最新動向
ソフトウェアエンジニアリング知識体系SWEBOK最新動向ソフトウェアエンジニアリング知識体系SWEBOK最新動向
ソフトウェアエンジニアリング知識体系SWEBOK最新動向
 
Athenzを用いたKubernetes Webhook Authorization
Athenzを用いたKubernetes Webhook AuthorizationAthenzを用いたKubernetes Webhook Authorization
Athenzを用いたKubernetes Webhook Authorization
 
IT Redhat Linux Resume
IT Redhat Linux Resume IT Redhat Linux Resume
IT Redhat Linux Resume
 
Apache Hive 紹介
Apache Hive 紹介Apache Hive 紹介
Apache Hive 紹介
 
Pythonで簡単動画解析
Pythonで簡単動画解析Pythonで簡単動画解析
Pythonで簡単動画解析
 
Pankaj Resume for Hadoop,Java,J2EE - Outside World
Pankaj Resume for Hadoop,Java,J2EE -  Outside WorldPankaj Resume for Hadoop,Java,J2EE -  Outside World
Pankaj Resume for Hadoop,Java,J2EE - Outside World
 
Hiveを高速化するLLAP
Hiveを高速化するLLAPHiveを高速化するLLAP
Hiveを高速化するLLAP
 
DeNAのインフラ戦略 〜クラウドジャーニーの舞台裏〜 [DeNA TechCon 2019]
DeNAのインフラ戦略 〜クラウドジャーニーの舞台裏〜 [DeNA TechCon 2019]DeNAのインフラ戦略 〜クラウドジャーニーの舞台裏〜 [DeNA TechCon 2019]
DeNAのインフラ戦略 〜クラウドジャーニーの舞台裏〜 [DeNA TechCon 2019]
 
Djangoアプリの実践的設計手法
Djangoアプリの実践的設計手法Djangoアプリの実践的設計手法
Djangoアプリの実践的設計手法
 

Similar to Integration of HIve and HBase

Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
enissoz
 

Similar to Integration of HIve and HBase (20)

Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 

Recently uploaded (20)

Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 

Integration of HIve and HBase

  • 1. Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 1
  • 2. About Me •  User and committer of Hadoop since 2007 •  Contributor to Apache Hadoop, HBase, Hive and Gora •  Joined Hortonworks as Member of Technical Staff •  Twitter: @enissoz Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2011
  • 3. Agenda •  Overview of Hive and HBase •  Hive + HBase Features and Improvements •  Future of Hive and HBase •  Q&A Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2011
  • 4. Apache Hive Overview • Apache Hive is a data warehouse system for Hadoop • SQL-like query language called HiveQL • Built for PB scale data • Main purpose is analysis and ad hoc querying • Database / table / partition / bucket – DDL Operations • SQL Types + Complex Types (ARRAY, MAP, etc) • Very extensible • Not for : small data sets, low latency queries, OLTP Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2011
  • 5. Apache Hive Architecture JDBC/ODBC Hive Thrift Hive Web CLI Server Interface Driver M S C Parser Planner l Metastore i e Execution Optimizer n t MapReduce HDFS RDBMS Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2011
  • 6. Overview of Apache HBase • Apache HBase is the Hadoop database • Modeled after Google’s BigTable • A sparse, distributed, persistent multi- dimensional sorted map • The map is indexed by a row key, column key, and a timestamp • Each value in the map is an un-interpreted array of bytes • Low latency random data access Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2011
  • 7. Overview of Apache HBase • Logical view: From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al. Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2011
  • 8. Apache HBase Architecture Client HMaster Zookeeper Region Region Region server server server Region Region Region Region Region Region HDFS Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2011
  • 9. Hive + HBase Features and Improvements Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2011
  • 10. Hive + HBase Motivation • Hive and HBase has different characteristics: High latency Low latency Structured vs. Unstructured Analysts Programmers • Hive datawarehouses on Hadoop are high latency – Long ETL times – Access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2011
  • 11. Use Case 1: HBase as ETL Data Sink From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2011
  • 12. Use Case 2: HBase as Data Source From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2011
  • 13. Use Case 3: Low Latency Warehouse From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010 Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2011
  • 14. Example: Hive + Hbase (HBase table) hbase(main):001:0> create 'short_urls', {NAME => 'u'}, {NAME=>'s'} hbase(main):014:0> scan 'short_urls' ROW COLUMN+CELL bit.ly/aaaa column=s:hits, value=100 bit.ly/aaaa column=u:url, value=hbase.apache.org/ bit.ly/abcd column=s:hits, value=123 bit.ly/abcd column=u:url, value=example.com/foo Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2011
  • 15. Example: Hive + HBase (Hive table) CREATE TABLE short_urls( short_url string, url string, hit_count int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits") TBLPROPERTIES ("hbase.table.name" = ”short_urls"); Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2011
  • 16. Storage Handler • Hive defines HiveStorageHandler class for different storage backends: HBase/ Cassandra / MongoDB/ etc • Storage Handler has hooks for –  Getting input / output formats –  Meta data operations hook: CREATE TABLE, DROP TABLE, etc • Storage Handler is a table level concept –  Does not support Hive partitions, and buckets Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2011
  • 17. Apache Hive + HBase Architecture Hive Thrift Hive Web CLI Server Interface Driver M S Parser Planner C l Metastore i Execution Optimizer e n t StorageHandler MapReduce HBase HDFS RDBMS Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Hive + HBase Integration • For Input/OutputFormat, getSplits(), etc underlying HBase classes are used • Column selection and certain filters can be pushed down • HBase tables can be used with other(Hadoop native) tables and SQL constructs • Hive DDL operations are converted to HBase DDL operations via the client hook. – All operations are performed by the client – No two phase commit Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2011
  • 19. Schema / Type Mapping Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2011
  • 20. Schema Mapping •  Hive table + columns + column types <=> HBase table + column families (+ column qualifiers) •  Every field in Hive table is mapped in order to either – The table key (using :key as selector) – A column family (cf:) -> MAP fields in Hive – A column (cf:cq) •  Hive table does not need to include all columns in HBase •  CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, u:url, s:hits, p:") Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2011
  • 21. Type Mapping • Recently added to Hive (0.9.0) • Previously all types were being converted to strings in HBase • Hive has: – Primitive types: INT, STRING, BINARY, DATE, etc – ARRAY<Type> – MAP<PrimitiveType, Type> – STRUCT<a:INT, b:STRING, c:STRING> • HBase does not have types – Bytes.toBytes() Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2011
  • 22. Type Mapping • Table level property "hbase.table.default.storage.type” = “binary” • Type mapping can be given per column after # – Any prefix of “binary” , eg u:url#b – Any prefix of “string” , eg u:url#s – The dash char “-” , eg u:url#- CREATE TABLE short_urls( short_url string, url string, hit_count int, props, map<string,string> ) WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s") Architecting the Future of Big Data Page 22 © Hortonworks Inc. 2011
  • 23. Type Mapping • If the type is not a primitive or Map, it is converted to a JSON string and serialized • Still a few rough edges for schema and type mapping: – No Hive BINARY support in HBase mapping – No mapping of HBase timestamp (can only provide put timestamp) – No arbitrary mapping of Structs / Arrays into HBase schema Architecting the Future of Big Data Page 23 © Hortonworks Inc. 2011
  • 24. Bulk Load • Steps to bulk load: – Sample source data for range partitioning – Save sampling results to a file – Run CLUSTER BY query using HiveHFileOutputFormat and TotalOrderPartitioner – Import Hfiles into HBase table • Ideal setup should be SET hive.hbase.bulk=true INSERT OVERWRITE TABLE web_table SELECT …. Architecting the Future of Big Data Page 24 © Hortonworks Inc. 2011
  • 25. Filter Pushdown Architecting the Future of Big Data Page 25 © Hortonworks Inc. 2011
  • 26. Filter Pushdown • Idea is to pass down filter expressions to the storage layer to minimize scanned data • To access indexes at HDFS or HBase • Example: CREATE EXTERNAL TABLE users (userid LONG, email STRING, … ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…") SELECT ... FROM users WHERE userid > 1000000 and email LIKE ‘%@gmail.com’; -> scan.setStartRow(Bytes.toBytes(1000000)) Architecting the Future of Big Data Page 26 © Hortonworks Inc. 2011
  • 27. Filter Decomposition • Optimizer pushes down the predicates to the query plan • Storage handlers can negotiate with the Hive optimizer to decompose the filter x > 3 AND upper(y) = 'XYZ’ • Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive • Works with: key = 3, key > 3, etc key > 3 AND key < 100 • Only works against constant expressions Architecting the Future of Big Data Page 27 © Hortonworks Inc. 2011
  • 28. Security Aspects Towards fully secure deployments Architecting the Future of Big Data Page 28 © Hortonworks Inc. 2011
  • 29. Security – Big Picture • Security becomes more important to support enterprise level and multi tenant applications • 5 Different Components to ensure / impose security – HDFS – MapReduce – HBase – Zookeeper – Hive • Each component has: – Authentication – Authorization Architecting the Future of Big Data Page 29 © Hortonworks Inc. 2011
  • 30. HBase Security – Closer look • Released with HBase 0.92 • Fully optional module, disabled by default • Needs an underlying secure Hadoop release • SecureRPCEngine: optional engine enforcing SASL authentication – Kerberos – DIGEST-MD5 based tokens – TokenProvider coprocessor • Access control is implemented as a Coprocessor: AccessController • Stores and distributes ACL data via Zookeeper – Sensitive data is only accessible by HBase daemons – Client does not need to authenticate to zk Architecting the Future of Big Data Page 30 © Hortonworks Inc. 2011
  • 31. Hive Security – Closer look • Hive has different deployment options, security considerations should take into account different deployments • Authentication is only supported at Metastore, not on HiveServer, web interface, JDBC • Authorization is enforced at the query layer (Driver) • Pluggable authorization providers. Default one stores global/ table/partition/column permissions in Metastore GRANT ALTER ON TABLE web_table TO USER bob; CREATE ROLE db_reader GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO ROLE db_reader Architecting the Future of Big Data Page 31 © Hortonworks Inc. 2011
  • 32. Hive Deployment Option 1 Client CLI Driver M Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 32 © Hortonworks Inc. 2011
  • 33. Hive Deployment Option 2 Client CLI Driver M Authorization S C Parser Planner l Authentication i e Metastore n Execution Optimizer t A/A A/A MapReduce HBase A12n/A11N A12n/A11N HDFS RDBMS Architecting the Future of Big Data Page 33 © Hortonworks Inc. 2011
  • 34. Hive Deployment Option 3 Client JDBC/ODBC Hive Thrift Hive Web CLI Server Interface M Driver Authorization S C Parser Planner l Authentication i Metastore e Execution Optimizer n t A/A A/A MapReduce HBase A12n/A11N HDFS A12n/A11N RDBMS Architecting the Future of Big Data Page 34 © Hortonworks Inc. 2011
  • 35. Hive + HBase + Hadoop Security • Regardless of Hive’s own security, for Hive to work on secure Hadoop and HBase, we should: – Obtain delegation tokens for Hadoop and HBase jobs – Ensure to obey the storage level (HDFS, HBase) permission checks – In HiveServer deployments, authenticate and impersonate the user • Delegation tokens for Hadoop are already working • Obtaining HBase delegation tokens are released in Hive 0.9.0 Architecting the Future of Big Data Page 35 © Hortonworks Inc. 2011
  • 36. Future of Hive + HBase • Improve on schema / type mapping • Fully secure Hive deployment options • HBase bulk import improvements • Sortable signed numeric types in HBase • Filter pushdown: non key column filters • Hive random access support for HBase – https://cwiki.apache.org/HCATALOG/random-access- framework.html Architecting the Future of Big Data Page 36 © Hortonworks Inc. 2011
  • 37. References • Security – https://issues.apache.org/jira/browse/HIVE-2764 – https://issues.apache.org/jira/browse/HBASE-5371 – https://issues.apache.org/jira/browse/HCATALOG-245 – https://issues.apache.org/jira/browse/HCATALOG-260 – https://issues.apache.org/jira/browse/HCATALOG-244 – https://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security +Design • Type mapping / Filter Pushdown – https://issues.apache.org/jira/browse/HIVE-1634 – https://issues.apache.org/jira/browse/HIVE-1226 – https://issues.apache.org/jira/browse/HIVE-1643 – https://issues.apache.org/jira/browse/HIVE-2815 – https://issues.apache.org/jira/browse/HIVE-1643 Architecting the Future of Big Data Page 37 © Hortonworks Inc. 2011
  • 38. Other Resources • Hadoop Summit – June 13-14 – San Jose, California – www.Hadoopsummit.org • Hadoop Training and Certification – Developing Solutions Using Apache Hadoop – Administering Apache Hadoop – Online classes available US, India, EMEA – http://hortonworks.com/training/ © Hortonworks Inc. 2012 Page 38
  • 39. Thanks Questions? Architecting the Future of Big Data Page 39 © Hortonworks Inc. 2011