6. Why Security?
• Apache Solr only provides minimal security features
“Solr
allows
any
client
with
access
to
it
to
add,
update,
and
delete
documents
(and
of
course
search/read
too),
including
access
to
the
Solr
configura<on
and
schema
files
and
the
administra<ve
user
interface.”[1]
• In the past, deployed as a single server
“It
is
strongly
recommended
that
the
applica<on
server
containing
Solr
be
firewalled
such
the
only
clients
with
access
to
Solr
are
your
own.”
[1]
7. Why Security?
• SolrCloud driving adoption in Big Data space
• Now, a component of a multi-tenant Hadoop cluster
• Non-‐solr
users
on
cluster
• Solr
communicates
across
machines
and
services
9. Why Apache Sentry?
• Sentry already established in Hadoop ecosystem
• Has
understood
authen<ca<on
model
(kerberos)
• Has
understood
privilege/ac<on
model
• Security-focused project
• Solr
focus
on
Search
Engine
• Sentry
focus
on
Security
11. Authentication
• Authentication: Verifying identity of a user or service
• Solr supports authenticating with dependent services (i.e. HDFS
and ZooKeeper*)
• Sentry goal: support other services / users authenticating with
Solr
• Consistent with other HTTP-level Hadoop services (e.g. Oozie
and HttpFs), Apache Sentry uses:
• Kerberos: a mutual authentication protocol that works on the
basis of “tickets”
• SPNego: a negotiation mechanism for selecting an underlying
authentication protocol
12. SPNego advantages
• HTTP Tools have built-in support for SPNego/Kerberos
• Web browsers
• curl (with --negotiate)
• HTTP libraries, including Apache HttpClient (used by solrj)
• Although an authentication (not authorization) protocol, can be
used for cluster-level access control
• Only grant kerberos credentials to users who should have access to the cluster
13. Authentication Setup
• Server side: use Sentry-provided web.xml which has a kerberos/
SPNego aware filter
• Have
to
setup
keytabs/principals/JAAS
configura<ons
• Client side: Sentry provides HttpClient / HttpSolrServer
configuration for communicating with kerberos/SPNego aware
Solr servers
• Have
to
setup
keytabs/principals/JAAS
configura<ons
• Cloudera Manager can do setup for you
15. Authorization
• Authorization: Controlling access to resources
• Solr does not provide collection/document authorization support
• Does support “hooks” via solr.xml and solrconfig.xml to override
request handler implementation
• Sentry uses these “hooks” to implement collection and document level
authorization
17. Collection-level Authorization
• Sentry supports role-based granting of privileges
• each
role
can
be
granted
QUERY,
UPDATE,
and/or
administra<ve
privileges
on
an
collec<on
• Privileges stored in a “policy file” on HDFS:
[groups]
#
Assigns
each
Hadoop
group
to
its
set
of
roles
dev_ops
=
engineer_role,
ops_role
[roles]
#
Assigns
each
role
to
its
set
of
privileges
engineer_role
=
collec<on
=
source_code-‐>ac<on=Query,
collec<on
=
source_code
-‐>
ac<on=Update
ops_role
=
collec<on
=
hbase_logs
-‐>
ac<on=Query
18. Integrating Sentry and Solr
• Sentry integrated via “hooks” in request handlers:
• Specified per collection in solrconfig.xml:
• Sentry ships with its own version of solrconfig.xml with secure handlers,
called solrconfig.xml.secure
19. Administrative requests
• That covers queries/updates of collections, but what about administrative
actions such as getting the status of the cores?
• In SolrCloud, admin looks like a collection:
http://localhost:8983/solr/admin/cores?action=STATUS
• Can just follow this structure in Sentry:
sample_role
=
collec<on
=
admin-‐>ac<on=Query,
• Secure Admin Handlers controlled via cluster-wide “solr.xml” in
ZooKeeper. By default, you get Secure Admin Handlers if Sentry is
enabled
20. Administrative requests
• Full privilege model documented here
• Examples (colllection1 = arbitrary collection name):
Ac-on
Required
Privilege
Collec-on
select
QUERY
collec<on1
update/json
UPDATE
collec<on1
ThreadDumpHandler
QUERY
admin
22. Document-level authorization motivation
• Collection-level authorization useful when access control requirements
for documents are homogeneous
• Security requirements may require restricting access to a subset of
documents
• Consider “Confidential” and “Secret” documents. How to store with only
collection-level authorization?
• Pushes complexity to application
23. Document-level authorization model
• Instead of Policy File in HDFS:
[groups]
#
Assigns
each
Hadoop
group
to
its
set
of
roles
dev_ops
=
engineer_role,
ops_role
[roles]
#
Assigns
each
role
to
its
set
of
privileges
engineer_role
=
collec<on
=
source_code-‐>ac<on=Query,
collec<on
=
source_code-‐>ac<on=Update
ops_role
=
collec<on
=
hbase_logs-‐>ac<on=Query
• Store authorization tokens in each document
• Many
more
documents
than
collec<ons;
doesn’t
scale
to
store
document-‐
level
info
in
Policy
File
• Can
use
Solr’s
built-‐in
filtering
capabili<es
to
restrict
access
24. Document-level authorization model
• A configurable field stores the authorization tokens
• The authorization tokens are Sentry roles, i.e. “ops_role”
[roles]
ops_role
=
collec<on
=
hbase_logs-‐>ac<on=Query
• Represents the roles that are allowed to view the document. To
view a document, the querying user must belong to at least one
role whose token is stored in the token field
• Can modify document permissions without restarting Solr
• Can modify role memberships without reindexing
25. Document-level authorization impl
• Intercepts the request via a SearchComponent
• SearchComponent adds an “fq” or FilterQuery
• Filter
out
all
documents
that
don’t
have
“role1”
or
“role2”
in
authField
• Filters are cached, so only construction expense once
• Note: does not supersede collection-level authorization
26. Document-level authorization config
• Configuration via solrconfig.xml.secure (per collection):
<!-‐-‐
Set
to
true
to
enabled
document-‐level
authoriza<on
-‐-‐>
<bool
name="enabled">false</bool>
<!-‐-‐
Field
where
the
auth
tokens
are
stored
in
the
document
-‐-‐>
<str
name="sentryAuthField">sentry_auth</str>
<!-‐-‐
Auth
token
defined
to
allow
any
role
to
access
the
document.
Uncomment
to
enable.
-‐-‐>
<!-‐-‐<str
name="allRolesToken">*</str>-‐-‐>
• No tokens = no access. To allow all users to access a document,
use the allRolesToken. Useful for getting started
28. Secure Impersonation
• But wait! My users don’t interact with Solr directly
• Custom web UI, load balancer, etc.
• Authorization won’t work!
• “user” is forgotten, request to Solr from “UI”
29. Secure Impersonation
• Secure impersonation: the ability of a “super-user” to submit
requests on behalf of another user
• Conceptually
similar
to
“sudo”
on
Unix
• Limited
to
only
groups/hosts
that
are
explicitly
configured
to
support
it
• Iden<cal
to
func<onality
provided
by
HDFS,
Oozie
30. Hue Search App UI
• Uses Secure Impersonation to integrate with its own security mechanisms
• Users
can
login
to
Hue
via
LDAP
or
other
auth
mechanism
• Hue
makes
requests
on
behalf
of
logged
in
user
• Only
Hue
user
requires
kerberos
keytab
• Seamlessly integrates with the collection and document-level access control
mechanisms
33. Index Test Setup
• 20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet
• Cloudera Search-1.2.0, CDH 4.6, MR1, CentOS 6.4
• 260M tweets/docs, indexed across 17 fields
• 116 GB, ~800 JSON .gz files, ~130MB per file, 3-fold HDFS
replication
• 1 Solr server and 1 shard per node (44M docs per shard), no Solr
replication
• Uses MapReduceIndexerTool contrib. mapper/reducer slots = 2x/1x
number of cores
• Solr heap size = 20GB
• Record end-to-end indexing time, i.e., indexing + mtree merge + go
live
• Record average from 3 repeats
34. Index Performance Testing
• Leg
column
is
unsecured
baseline.
• Center
column
is
~20%
lower
→
HDFS
security
introduces
~20%
performance
overhead.
• Right
column
is
~same
as
center
column
→
Solr
security
introduces
no
addi<onal
overhead.
35. Query Test Setup
• Same setup as MapReduce batch indexing
• Uses the output of MapReduce batch indexing
• 1 client, 30 threads per client
• Uses internal tool - QueryRunner
• Similar
to
SolrMeter
and
JMeter
• Query randomly sampled from fixed set of 10,000 strings
• Record per thread query throughput for 5 runs of 30 min each
36. Query Performance Testing
• Leg
column
is
unsecured
baseline.
• Center
column
is
~13%
lower
→
HDFS
security
introduces
~13%
performance
overhead.
• Right
column
is
same
as
center
column
→
Solr
security
introduces
no
addi<onal
overhead.
38. Future Work
• Support for Sentry service with improved APIs / performance /
integration
• Already supported for Hive/Impala
• Currently in development upstream
• “Lineage” security: data flows from one system to another and
retains security criteria
• Example: Index HBase data for full-text queries in Solr. HBase Table
and Cell-level security tags automatically applied to Solr Collections,
Documents, and Fields
39. Questions?
• Thanks for listening!
• More information / Want to contribute?
http://sentry.incubator.apache.org/
• Questions?