Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Secure Solr With Apache Sentry
Gregory Chanan, Engineer @ Cloudera
gchanan AT cloudera.com
Who Am I?
•  Software Engineer at Cloudera
•  Apache Solr Committer
•  Apache Sentry Committer (incubating)
•  Apache HBas...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Why Security?
•  Apache Solr only provides minimal security features
“Solr	
  allows	
  any	
  client	
  with	
  access	
 ...
Why Security?
•  SolrCloud driving adoption in Big Data space
•  Now, a component of a multi-tenant Hadoop cluster
•  Non-...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Why Apache Sentry?
•  Sentry already established in Hadoop ecosystem
•  Has	
  understood	
  authen<ca<on	
  model	
  (ker...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Authentication
•  Authentication: Verifying identity of a user or service
•  Solr supports authenticating with dependent s...
SPNego advantages
•  HTTP Tools have built-in support for SPNego/Kerberos
•  Web browsers
•  curl (with --negotiate)
•  HT...
Authentication Setup
•  Server side: use Sentry-provided web.xml which has a kerberos/
SPNego aware filter
•  Have	
  to	
...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Authorization
•  Authorization: Controlling access to resources
•  Solr does not provide collection/document authorization...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Collection-level Authorization
•  Sentry supports role-based granting of privileges
•  each	
  role	
  can	
  be	
  grante...
Integrating Sentry and Solr
•  Sentry integrated via “hooks” in request handlers:
•  Specified per collection in solrconfi...
Administrative requests
•  That covers queries/updates of collections, but what about administrative
actions such as getti...
Administrative requests
•  Full privilege model documented here
•  Examples (colllection1 = arbitrary collection name):
Ac...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Document-level authorization motivation
•  Collection-level authorization useful when access control requirements
for docu...
Document-level authorization model
•  Instead of Policy File in HDFS:
[groups]	
  
#	
  Assigns	
  each	
  Hadoop	
  group...
Document-level authorization model
•  A configurable field stores the authorization tokens
•  The authorization tokens are...
Document-level authorization impl
•  Intercepts the request via a SearchComponent
•  SearchComponent adds an “fq” or Filte...
Document-level authorization config
•  Configuration via solrconfig.xml.secure (per collection):
	
  	
  	
  <!-­‐-­‐	
  S...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Secure Impersonation
•  But wait! My users don’t interact with Solr directly
•  Custom web UI, load balancer, etc.
•  Auth...
Secure Impersonation
•  Secure impersonation: the ability of a “super-user” to submit
requests on behalf of another user
•...
Hue Search App UI
•  Uses Secure Impersonation to integrate with its own security mechanisms
•  Users	
  can	
  login	
  t...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Performance Testing
•  Goal is to measure overhead of:
•  Kerberos Authentication
•  Sentry Collection-Level Authorization...
Index Test Setup
•  20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet
•  Cloudera Search-1.2.0, CDH 4.6, M...
Index Performance Testing
•  Leg	
  column	
  is	
  unsecured	
  
baseline.	
  
•  Center	
  column	
  is	
  ~20%	
  
lowe...
Query Test Setup
•  Same setup as MapReduce batch indexing
•  Uses the output of MapReduce batch indexing
•  1 client, 30 ...
Query Performance Testing
•  Leg	
  column	
  is	
  unsecured	
  
baseline.	
  
•  Center	
  column	
  is	
  ~13%	
  
lowe...
Overview
•  Motivation
•  Why security for Solr / SolrCloud?
•  Why Apache Sentry?
•  Authentication
•  Authorization
•  C...
Future Work
•  Support for Sentry service with improved APIs / performance /
integration
•  Already supported for Hive/Imp...
Questions?
•  Thanks for listening!
•  More information / Want to contribute?
http://sentry.incubator.apache.org/
•  Quest...
Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr: Presented by Gregory Chanan, ...
Upcoming SlideShare
Loading in …5
×

Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr: Presented by Gregory Chanan, Cloudera

4,933 views

Published on

Presented at Lucene/Solr Revolution 2014

Published in: Technology
  • Be the first to comment

Secure Search - Using Apache Sentry to Add Authentication and Authorization Support to Solr: Presented by Gregory Chanan, Cloudera

  1. 1. Secure Solr With Apache Sentry Gregory Chanan, Engineer @ Cloudera gchanan AT cloudera.com
  2. 2. Who Am I? •  Software Engineer at Cloudera •  Apache Solr Committer •  Apache Sentry Committer (incubating) •  Apache HBase Committer
  3. 3. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  4. 4. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  5. 5. Why Security? •  Apache Solr only provides minimal security features “Solr  allows  any  client  with  access  to  it  to  add,  update,  and  delete  documents     (and  of  course  search/read  too),  including  access  to  the  Solr  configura<on  and   schema  files  and  the  administra<ve  user  interface.”[1]     •  In the past, deployed as a single server “It  is  strongly  recommended  that  the  applica<on  server  containing  Solr  be  firewalled  such   the  only  clients  with  access  to  Solr  are  your  own.”  [1]  
  6. 6. Why Security? •  SolrCloud driving adoption in Big Data space •  Now, a component of a multi-tenant Hadoop cluster •  Non-­‐solr  users  on  cluster   •  Solr  communicates  across  machines  and  services  
  7. 7. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  8. 8. Why Apache Sentry? •  Sentry already established in Hadoop ecosystem •  Has  understood  authen<ca<on  model  (kerberos)   •  Has  understood  privilege/ac<on  model   •  Security-focused project •  Solr  focus  on  Search  Engine   •  Sentry  focus  on  Security  
  9. 9. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  10. 10. Authentication •  Authentication: Verifying identity of a user or service •  Solr supports authenticating with dependent services (i.e. HDFS and ZooKeeper*) •  Sentry goal: support other services / users authenticating with Solr •  Consistent with other HTTP-level Hadoop services (e.g. Oozie and HttpFs), Apache Sentry uses: •  Kerberos: a mutual authentication protocol that works on the basis of “tickets” •  SPNego: a negotiation mechanism for selecting an underlying authentication protocol
  11. 11. SPNego advantages •  HTTP Tools have built-in support for SPNego/Kerberos •  Web browsers •  curl (with --negotiate) •  HTTP libraries, including Apache HttpClient (used by solrj) •  Although an authentication (not authorization) protocol, can be used for cluster-level access control •  Only grant kerberos credentials to users who should have access to the cluster
  12. 12. Authentication Setup •  Server side: use Sentry-provided web.xml which has a kerberos/ SPNego aware filter •  Have  to  setup  keytabs/principals/JAAS  configura<ons     •  Client side: Sentry provides HttpClient / HttpSolrServer configuration for communicating with kerberos/SPNego aware Solr servers •  Have  to  setup  keytabs/principals/JAAS  configura<ons   •  Cloudera Manager can do setup for you
  13. 13. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  14. 14. Authorization •  Authorization: Controlling access to resources •  Solr does not provide collection/document authorization support •  Does support “hooks” via solr.xml and solrconfig.xml to override request handler implementation •  Sentry uses these “hooks” to implement collection and document level authorization
  15. 15. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  16. 16. Collection-level Authorization •  Sentry supports role-based granting of privileges •  each  role  can  be  granted  QUERY,  UPDATE,  and/or  administra<ve  privileges   on  an  collec<on   •  Privileges stored in a “policy file” on HDFS: [groups]   #  Assigns  each  Hadoop  group  to  its  set  of  roles   dev_ops  =  engineer_role,  ops_role   [roles]   #  Assigns  each  role  to  its  set  of  privileges   engineer_role  =  collec<on  =  source_code-­‐>ac<on=Query,      collec<on  =  source_code  -­‐>  ac<on=Update   ops_role  =  collec<on  =  hbase_logs  -­‐>  ac<on=Query  
  17. 17. Integrating Sentry and Solr •  Sentry integrated via “hooks” in request handlers: •  Specified per collection in solrconfig.xml: •  Sentry ships with its own version of solrconfig.xml with secure handlers, called solrconfig.xml.secure
  18. 18. Administrative requests •  That covers queries/updates of collections, but what about administrative actions such as getting the status of the cores? •  In SolrCloud, admin looks like a collection: http://localhost:8983/solr/admin/cores?action=STATUS •  Can just follow this structure in Sentry: sample_role  =  collec<on  =  admin-­‐>ac<on=Query,   •  Secure Admin Handlers controlled via cluster-wide “solr.xml” in ZooKeeper. By default, you get Secure Admin Handlers if Sentry is enabled
  19. 19. Administrative requests •  Full privilege model documented here •  Examples (colllection1 = arbitrary collection name): Ac-on   Required  Privilege   Collec-on   select   QUERY   collec<on1   update/json   UPDATE   collec<on1   ThreadDumpHandler   QUERY   admin  
  20. 20. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  21. 21. Document-level authorization motivation •  Collection-level authorization useful when access control requirements for documents are homogeneous •  Security requirements may require restricting access to a subset of documents •  Consider “Confidential” and “Secret” documents. How to store with only collection-level authorization? •  Pushes complexity to application
  22. 22. Document-level authorization model •  Instead of Policy File in HDFS: [groups]   #  Assigns  each  Hadoop  group  to  its  set  of  roles   dev_ops  =  engineer_role,  ops_role   [roles]   #  Assigns  each  role  to  its  set  of  privileges   engineer_role  =  collec<on  =  source_code-­‐>ac<on=Query,      collec<on  =  source_code-­‐>ac<on=Update   ops_role  =  collec<on  =  hbase_logs-­‐>ac<on=Query   •  Store authorization tokens in each document •  Many  more  documents  than  collec<ons;  doesn’t  scale  to  store  document-­‐ level  info  in  Policy  File   •  Can  use  Solr’s  built-­‐in  filtering  capabili<es  to  restrict  access  
  23. 23. Document-level authorization model •  A configurable field stores the authorization tokens •  The authorization tokens are Sentry roles, i.e. “ops_role”  [roles]    ops_role  =  collec<on  =  hbase_logs-­‐>ac<on=Query   •  Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field •  Can modify document permissions without restarting Solr •  Can modify role memberships without reindexing
  24. 24. Document-level authorization impl •  Intercepts the request via a SearchComponent •  SearchComponent adds an “fq” or FilterQuery •  Filter  out  all  documents  that  don’t  have  “role1”  or  “role2”  in  authField   •  Filters are cached, so only construction expense once •  Note: does not supersede collection-level authorization
  25. 25. Document-level authorization config •  Configuration via solrconfig.xml.secure (per collection):      <!-­‐-­‐  Set  to  true  to  enabled  document-­‐level  authoriza<on  -­‐-­‐>        <bool  name="enabled">false</bool>        <!-­‐-­‐  Field  where  the  auth  tokens  are  stored  in  the  document  -­‐-­‐>        <str  name="sentryAuthField">sentry_auth</str>        <!-­‐-­‐  Auth  token  defined  to  allow  any  role  to  access  the    document.              Uncomment  to  enable.  -­‐-­‐>        <!-­‐-­‐<str  name="allRolesToken">*</str>-­‐-­‐>   •  No tokens = no access. To allow all users to access a document, use the allRolesToken. Useful for getting started
  26. 26. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  27. 27. Secure Impersonation •  But wait! My users don’t interact with Solr directly •  Custom web UI, load balancer, etc. •  Authorization won’t work! •  “user” is forgotten, request to Solr from “UI”  
  28. 28. Secure Impersonation •  Secure impersonation: the ability of a “super-user” to submit requests on behalf of another user •  Conceptually  similar  to  “sudo”  on  Unix   •  Limited  to  only  groups/hosts  that  are  explicitly  configured  to  support  it   •  Iden<cal  to  func<onality  provided  by  HDFS,  Oozie    
  29. 29. Hue Search App UI •  Uses Secure Impersonation to integrate with its own security mechanisms •  Users  can  login  to  Hue  via  LDAP  or  other  auth  mechanism   •  Hue  makes  requests  on  behalf  of  logged  in  user   •  Only  Hue  user  requires  kerberos  keytab   •  Seamlessly integrates with the collection and document-level access control mechanisms
  30. 30. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  31. 31. Performance Testing •  Goal is to measure overhead of: •  Kerberos Authentication •  Sentry Collection-Level Authorization •  Measure index, query overhead separately
  32. 32. Index Test Setup •  20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet •  Cloudera Search-1.2.0, CDH 4.6, MR1, CentOS 6.4 •  260M tweets/docs, indexed across 17 fields •  116 GB, ~800 JSON .gz files, ~130MB per file, 3-fold HDFS replication •  1 Solr server and 1 shard per node (44M docs per shard), no Solr replication •  Uses MapReduceIndexerTool contrib. mapper/reducer slots = 2x/1x number of cores •  Solr heap size = 20GB •  Record end-to-end indexing time, i.e., indexing + mtree merge + go live •  Record average from 3 repeats
  33. 33. Index Performance Testing •  Leg  column  is  unsecured   baseline.   •  Center  column  is  ~20%   lower  →  HDFS  security   introduces  ~20%   performance  overhead.   •  Right  column  is  ~same  as   center  column  →  Solr   security  introduces  no   addi<onal  overhead.    
  34. 34. Query Test Setup •  Same setup as MapReduce batch indexing •  Uses the output of MapReduce batch indexing •  1 client, 30 threads per client •  Uses internal tool - QueryRunner •  Similar  to  SolrMeter  and  JMeter   •  Query randomly sampled from fixed set of 10,000 strings •  Record per thread query throughput for 5 runs of 30 min each
  35. 35. Query Performance Testing •  Leg  column  is  unsecured   baseline.   •  Center  column  is  ~13%   lower  →  HDFS  security   introduces  ~13%   performance  overhead.   •  Right  column  is  same  as   center  column  →  Solr   security  introduces  no   addi<onal  overhead.    
  36. 36. Overview •  Motivation •  Why security for Solr / SolrCloud? •  Why Apache Sentry? •  Authentication •  Authorization •  Collection-level •  Document-level •  Secure Impersonation •  Performance •  Future Work
  37. 37. Future Work •  Support for Sentry service with improved APIs / performance / integration •  Already supported for Hive/Impala •  Currently in development upstream •  “Lineage” security: data flows from one system to another and retains security criteria •  Example: Index HBase data for full-text queries in Solr. HBase Table and Cell-level security tags automatically applied to Solr Collections, Documents, and Fields
  38. 38. Questions? •  Thanks for listening! •  More information / Want to contribute? http://sentry.incubator.apache.org/ •  Questions?

×