Apache Hive 
August 2014 
© Hortonworks Inc. 2014 
Page 1 
Use cases and security solutions 
Thejas Nair 
@thejasn
What are we talking about ? 
• Introduce key security concepts 
• Use cases 
• Authorization solutions 
• Followed by specific use cases and 
experience at Yahoo! 
© Hortonworks Inc. 2014 Page 2
Authentication vs Authorization 
• Authentication 
– Verifying your identity 
– Enabled in Hadoop using Kerberos 
– More options with HiveServer2 
• Authorization 
– Verifying if you have permissions to perform this action 
Pic1 – – https://flic.kr/p/5qQiJR QiJR Pic2 - https://flic.kr/p/3i4SW 
© Hortonworks Inc. 2014 Page 3
What is Apache Hive? 
It depends on who you ask! 
https://flic.kr/p/nff9gY 
© Hortonworks Inc. 2014 Page 4
What is Apache Hive? 
Its a table 
oriented 
storage layer! 
It is a SQL 
database! 
© Hortonworks Inc. 2014 Page 5
Components - The table storage layer 
Pig/MR 
Hcatalog 
Data Metadata 
HDFS Metastore 
© Hortonworks Inc. 2014 Page 6
Authorization – The table storage layer 
• Data – FileSystem 
– /hive/warehouse/…/table1/ 
– Traditional POSIX permissions 
– rwxr-x--- owner: thejas, group: dev 
– More flexibility with Access Control Lists 
– More flexibility with Apache Argus (incubating) 
© Hortonworks Inc. 2014 Page 7
Authorization – The table storage layer 
• Metadata 
– {name : table1, storage_info : {dir : /hive/…/ 
table1}, columns: {..}, .. } 
– Authorization ? 
© Hortonworks Inc. 2014 Page 8
Storage Based Authorization 
• Don’t add another source of truth for 
authorization! 
• Metadata access based on 
corresponding data access. 
© Hortonworks Inc. 2014 Page 9
Enabling Storage Based Authorization 
• Update configuration in metastore 
– http://s.apache.org/SBA 
– Ensure that only metastore server has access 
to its RDBMS 
© Hortonworks Inc. 2014 Page 10
Hive as a SQL query engine 
• Hive command line 
– bin/hive –e ‘select * from ..’ 
– Same use case as Pig, MapReduce 
– Storage Based Authorization applicable here 
© Hortonworks Inc. 2014 Page 11
Hive as a SQL query engine 
• ODBC/JDBC application/tools 
– Adds HiveServer2 at the front 
– Query processing – same way as 
commandline 
– Storage Based Authorization applicable here 
– Have query run as end user 
– Default configuration 
hive.server2.enable.doAs=true 
© Hortonworks Inc. 2014 Page 12
SBA : What is great about it? 
• Simple. 
– One source of truth. Just manage the 
FileSystem permissions. 
• Flexible HDFS ACL support 
– Requires upcoming hive 0.14 release. 
© Hortonworks Inc. 2014 Page 13
SBA: What is missing ? 
• Access control at row and column level 
– FileSystem permissions are at the level of dir 
and files 
© Hortonworks Inc. 2014 Page 14
Fine grained control : pre-requisites 
• Data access api should be fine grained 
– API needs support for row/column concept 
• HiveServer2 ? 
– Data server for ODBC/JDBC 
– SQL as api supports selecting rows,columns 
© Hortonworks Inc. 2014 Page 15
SQL standards based authorization 
• Fine grained authorization with 
HiveServer2 
• Grant/Revoke statements 
• Based on SQL standard 
© Hortonworks Inc. 2014 Page 16
SQL std based auth: How it works 
• Compile Query 
• -> Query Plan 
• -> Actions required on objects 
– (eg READ : table1, DROP : table2) 
• -> Privileges on objects 
– (eg SELECT : table1, OWNER: table2) 
• Check if user has required privileges 
© Hortonworks Inc. 2014 Page 17
Authorization Policy 
• GRANT/REVOKE <PRIVILEGE> ON 
<OBJECT> TO <USERS> 
• <USERS> can be a user or a role 
• Delegate management of privileges/ 
roles 
• Hive ‘DBA’ can be added to ‘ADMIN’ role 
© Hortonworks Inc. 2014 Page 18
Fine Grained Authorization 
• Supported using views 
– Grant access to view, not base table 
– Select clause – select columns 
– Where clause – select rows 
© Hortonworks Inc. 2014 Page 19
Restrictions 
• Disallows features that bypass the fine 
grained authorization checks. 
• dfs commands, transform clause, 
create udfs 
• admin can add permanent UDFs 
© Hortonworks Inc. 2014 Page 20
SQL std based auth: Query processing 
• Grant access on files for HiveServer2 
process user 
• Run queries as this user 
– Configure hive.server2.enable.doAs=false 
© Hortonworks Inc. 2014 Page 21
Extending Hive Authorization 
• Authorization plugin API 
• Apache Argus first user 
© Hortonworks Inc. 2014 Page 22
Hive default authorization 
• Grant/revoke based access control 
• Unsecure/incomplete model 
• Unsecure model for Hive command line 
© Hortonworks Inc. 2014 Page 23
Conclusion 
• Playing well with each other 
1. Metadata authorization using Storage Based 
Authorization 
2. Fine grained authorization options in 
HiveServer2 
3. Both 1 & 2 
© Hortonworks Inc. 2014 Page 24
Use Cases at Yahoo! 
PRESENTED BY Chris Drome⎪ August 20, 2014
Overview of Use Cases 
§ Column and row level access controls 
› Hive 0.13 SQL Standards Based Hive Authorization 
• Authorization model managed by metastore 
› HiveServer2 
• Serving engine with authorization plugin 
› Views 
• Fine grain authorization on a table 
§ (Limited) Authorization for Hive CLI 
› HCatalog server-side security 
› HDFS file permission based authorization (StorageBasedAuthorizationProvider) 
› HiveMetastoreAuthorizationProvider plugin 
26 Yahoo Confidential & Proprietary
The Players 
§ Producers 
› ETL jobs load data to grid 
› Primarily Pig jobs 
› Some MR jobs 
› Owners of the data (read/write file permissions) 
• Owner of directories and files 
§ Consumers 
› Consumes some sub-set of data 
› Readers of the data (read-only file permissions) 
• Member of group with read-only permissions 
27 Yahoo Confidential & Proprietary
The Challenges 
§ Producers 
› Latency SLAs on a large volume of data 
› Responsible for managing data 
• Reloading data, rolling up data, archiving data 
› Responsible for managing access to data (groups) 
§ Consumers 
› Access controlled by membership in consumer group 
› Access controls at column or row level not possible 
› Limited to one group per table 
› Access may be through Pig, Hive, MR, BI tools, etc. 
28 Yahoo Confidential & Proprietary
Fine Grain Access Control with HiveServer2 
§ HiveServer2 as query execution engine 
§ HiveServer2 responsible for verifying authorization 
§ HiveServer2 runs as “super-user” with read privileges 
› Connecting user doesn’t have access permissions on underlying files 
› Executes query on behalf of connecting user 
§ Define arbitrary access controls with views on tables 
› Able to restrict by columns and/or rows 
› Grant access to individual users 
§ Prototype with Sentry as proof-of-concept 
29 Yahoo Confidential & Proprietary
(Limited) Authorization for Hive CLI 
§ Not practical to prevent use of Hive CLI 
§ Hive CLI could be used to circumvent HS2-based authorization 
§ HCatalog server-side security uses StorageBasedAuthorizationProvider 
to check HDFS access permissions 
› Chain with an authorization plugin (HiveMetastoreAuthorizationProvider) 
§ Perform HCatalog-based authorization of DDL tasks 
› Prevent users from creating/dropping objects in databases without authorization 
§ Perform HCatalog-based authorization for data access 
§ Simple prototype as proof-of-concept 
30 Yahoo Confidential & Proprietary

August 2014 HUG : Hive 13 Security

  • 1.
    Apache Hive August2014 © Hortonworks Inc. 2014 Page 1 Use cases and security solutions Thejas Nair @thejasn
  • 2.
    What are wetalking about ? • Introduce key security concepts • Use cases • Authorization solutions • Followed by specific use cases and experience at Yahoo! © Hortonworks Inc. 2014 Page 2
  • 3.
    Authentication vs Authorization • Authentication – Verifying your identity – Enabled in Hadoop using Kerberos – More options with HiveServer2 • Authorization – Verifying if you have permissions to perform this action Pic1 – – https://flic.kr/p/5qQiJR QiJR Pic2 - https://flic.kr/p/3i4SW © Hortonworks Inc. 2014 Page 3
  • 4.
    What is ApacheHive? It depends on who you ask! https://flic.kr/p/nff9gY © Hortonworks Inc. 2014 Page 4
  • 5.
    What is ApacheHive? Its a table oriented storage layer! It is a SQL database! © Hortonworks Inc. 2014 Page 5
  • 6.
    Components - Thetable storage layer Pig/MR Hcatalog Data Metadata HDFS Metastore © Hortonworks Inc. 2014 Page 6
  • 7.
    Authorization – Thetable storage layer • Data – FileSystem – /hive/warehouse/…/table1/ – Traditional POSIX permissions – rwxr-x--- owner: thejas, group: dev – More flexibility with Access Control Lists – More flexibility with Apache Argus (incubating) © Hortonworks Inc. 2014 Page 7
  • 8.
    Authorization – Thetable storage layer • Metadata – {name : table1, storage_info : {dir : /hive/…/ table1}, columns: {..}, .. } – Authorization ? © Hortonworks Inc. 2014 Page 8
  • 9.
    Storage Based Authorization • Don’t add another source of truth for authorization! • Metadata access based on corresponding data access. © Hortonworks Inc. 2014 Page 9
  • 10.
    Enabling Storage BasedAuthorization • Update configuration in metastore – http://s.apache.org/SBA – Ensure that only metastore server has access to its RDBMS © Hortonworks Inc. 2014 Page 10
  • 11.
    Hive as aSQL query engine • Hive command line – bin/hive –e ‘select * from ..’ – Same use case as Pig, MapReduce – Storage Based Authorization applicable here © Hortonworks Inc. 2014 Page 11
  • 12.
    Hive as aSQL query engine • ODBC/JDBC application/tools – Adds HiveServer2 at the front – Query processing – same way as commandline – Storage Based Authorization applicable here – Have query run as end user – Default configuration hive.server2.enable.doAs=true © Hortonworks Inc. 2014 Page 12
  • 13.
    SBA : Whatis great about it? • Simple. – One source of truth. Just manage the FileSystem permissions. • Flexible HDFS ACL support – Requires upcoming hive 0.14 release. © Hortonworks Inc. 2014 Page 13
  • 14.
    SBA: What ismissing ? • Access control at row and column level – FileSystem permissions are at the level of dir and files © Hortonworks Inc. 2014 Page 14
  • 15.
    Fine grained control: pre-requisites • Data access api should be fine grained – API needs support for row/column concept • HiveServer2 ? – Data server for ODBC/JDBC – SQL as api supports selecting rows,columns © Hortonworks Inc. 2014 Page 15
  • 16.
    SQL standards basedauthorization • Fine grained authorization with HiveServer2 • Grant/Revoke statements • Based on SQL standard © Hortonworks Inc. 2014 Page 16
  • 17.
    SQL std basedauth: How it works • Compile Query • -> Query Plan • -> Actions required on objects – (eg READ : table1, DROP : table2) • -> Privileges on objects – (eg SELECT : table1, OWNER: table2) • Check if user has required privileges © Hortonworks Inc. 2014 Page 17
  • 18.
    Authorization Policy •GRANT/REVOKE <PRIVILEGE> ON <OBJECT> TO <USERS> • <USERS> can be a user or a role • Delegate management of privileges/ roles • Hive ‘DBA’ can be added to ‘ADMIN’ role © Hortonworks Inc. 2014 Page 18
  • 19.
    Fine Grained Authorization • Supported using views – Grant access to view, not base table – Select clause – select columns – Where clause – select rows © Hortonworks Inc. 2014 Page 19
  • 20.
    Restrictions • Disallowsfeatures that bypass the fine grained authorization checks. • dfs commands, transform clause, create udfs • admin can add permanent UDFs © Hortonworks Inc. 2014 Page 20
  • 21.
    SQL std basedauth: Query processing • Grant access on files for HiveServer2 process user • Run queries as this user – Configure hive.server2.enable.doAs=false © Hortonworks Inc. 2014 Page 21
  • 22.
    Extending Hive Authorization • Authorization plugin API • Apache Argus first user © Hortonworks Inc. 2014 Page 22
  • 23.
    Hive default authorization • Grant/revoke based access control • Unsecure/incomplete model • Unsecure model for Hive command line © Hortonworks Inc. 2014 Page 23
  • 24.
    Conclusion • Playingwell with each other 1. Metadata authorization using Storage Based Authorization 2. Fine grained authorization options in HiveServer2 3. Both 1 & 2 © Hortonworks Inc. 2014 Page 24
  • 25.
    Use Cases atYahoo! PRESENTED BY Chris Drome⎪ August 20, 2014
  • 26.
    Overview of UseCases § Column and row level access controls › Hive 0.13 SQL Standards Based Hive Authorization • Authorization model managed by metastore › HiveServer2 • Serving engine with authorization plugin › Views • Fine grain authorization on a table § (Limited) Authorization for Hive CLI › HCatalog server-side security › HDFS file permission based authorization (StorageBasedAuthorizationProvider) › HiveMetastoreAuthorizationProvider plugin 26 Yahoo Confidential & Proprietary
  • 27.
    The Players §Producers › ETL jobs load data to grid › Primarily Pig jobs › Some MR jobs › Owners of the data (read/write file permissions) • Owner of directories and files § Consumers › Consumes some sub-set of data › Readers of the data (read-only file permissions) • Member of group with read-only permissions 27 Yahoo Confidential & Proprietary
  • 28.
    The Challenges §Producers › Latency SLAs on a large volume of data › Responsible for managing data • Reloading data, rolling up data, archiving data › Responsible for managing access to data (groups) § Consumers › Access controlled by membership in consumer group › Access controls at column or row level not possible › Limited to one group per table › Access may be through Pig, Hive, MR, BI tools, etc. 28 Yahoo Confidential & Proprietary
  • 29.
    Fine Grain AccessControl with HiveServer2 § HiveServer2 as query execution engine § HiveServer2 responsible for verifying authorization § HiveServer2 runs as “super-user” with read privileges › Connecting user doesn’t have access permissions on underlying files › Executes query on behalf of connecting user § Define arbitrary access controls with views on tables › Able to restrict by columns and/or rows › Grant access to individual users § Prototype with Sentry as proof-of-concept 29 Yahoo Confidential & Proprietary
  • 30.
    (Limited) Authorization forHive CLI § Not practical to prevent use of Hive CLI § Hive CLI could be used to circumvent HS2-based authorization § HCatalog server-side security uses StorageBasedAuthorizationProvider to check HDFS access permissions › Chain with an authorization plugin (HiveMetastoreAuthorizationProvider) § Perform HCatalog-based authorization of DDL tasks › Prevent users from creating/dropping objects in databases without authorization § Perform HCatalog-based authorization for data access § Simple prototype as proof-of-concept 30 Yahoo Confidential & Proprietary