Access control in an open source search solution Tom Mortimer, Flax London Intranet show & tell: Intranet Search, 2010
Flax <ul><li>Search engine specialists </li></ul><ul><li>Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur C...
The customer Tait Electronics Ltd. <ul><li>A g lobal leader in designing and delivering radio solutions </li></ul><ul><li>...
The job <ul><li>12 million documents </li></ul><ul><li>Various formats (MS Office, OpenOffice, PDF etc.) </li></ul><ul><li...
Customer requirements <ul><li>Search results in under 1s </li></ul><ul><li>Facets </li></ul><ul><li>User tagging </li></ul...
Basics: Search engine <ul><li>We chose  Xapian </li></ul><ul><li>Open source (GPL) </li></ul><ul><li>Probabilistic ranking...
Basics: Indexing <ul><li>Implemented in  Python </li></ul><ul><li>Rapid development </li></ul><ul><li>Readability and main...
Basics: Front end <ul><li>Web app implemented in  Python </li></ul><ul><li>WSGI :  mod_wsgi  on  Apache 2 </li></ul><ul><l...
User tagging <ul><li>Web app writes temporary file containing the tag info </li></ul><ul><li>Indexer watches directory wit...
Access Control: Plan A <ul><li>Store ACLs and Unix permissions with each document in the index </li></ul><ul><li>Use a Xap...
Access Control: Plan A <ul><li>Store ACLs and Unix permissions with each document in the index </li></ul><ul><li>Use a Xap...
Access Control: Plan B <ul><li>Check whether current user can read each file directly from the file server </li></ul><ul><...
Access Control: Plan B <ul><li>Check whether current user can read each file directly from the file server </li></ul><ul><...
Access Control: Plan C <ul><li>At indexing time, iterate user list for each document, and check readability with OS </li><...
Access Control: Plan C <ul><li>At indexing time, iterate user list for each document, and check readability with OS </li><...
Hardware <ul><li>1 x Dell™ PowerEdge™ R710 Rack Mount Server </li></ul><ul><li>2 x QuadCore E2550 Intel processors @ 2.26G...
Result
What did we learn? <ul><li>Access Control is not trivial </li></ul><ul><li>The first approach isn't always (usually) the b...
Upcoming SlideShare
Loading in...5
×

Intranet show and_tell_2010

368

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
368
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Intranet show and_tell_2010

  1. 1. Access control in an open source search solution Tom Mortimer, Flax London Intranet show & tell: Intranet Search, 2010
  2. 2. Flax <ul><li>Search engine specialists </li></ul><ul><li>Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd </li></ul><ul><li>Based in Cambridge UK </li></ul><ul><li>Contributors to and users of Xapian </li></ul><ul><li>Recently selected as UK Authorized Partner by Lucid Imagination </li></ul><ul><li>Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen </li></ul>Apache Lucene and Solr are trademarks of The Apache Software Foundation
  3. 3. The customer Tait Electronics Ltd. <ul><li>A g lobal leader in designing and delivering radio solutions </li></ul><ul><li>Customers include public safety agencies, government services, urban transport providers etc. </li></ul><ul><li>Corporate services based in New Zealand - network of worldwide offices and distributors </li></ul>
  4. 4. The job <ul><li>12 million documents </li></ul><ul><li>Various formats (MS Office, OpenOffice, PDF etc.) </li></ul><ul><li>99% English language </li></ul><ul><li>On three Sun Thumpers running Solaris/ZFS </li></ul><ul><li>Exported via CIFS to end users on Windows </li></ul><ul><li>User access using Unix permissions and ACLs </li></ul><ul><li>User authentication with LDAP </li></ul><ul><li>Available globally, but less than 1000 regular users </li></ul>
  5. 5. Customer requirements <ul><li>Search results in under 1s </li></ul><ul><li>Facets </li></ul><ul><li>User tagging </li></ul><ul><li>Results filtered by file permissions </li></ul><ul><li>Index kept up to date daily </li></ul>Customer had considered a variety of commercial search engines including search appliances, but rejected these in favour of an open source solution due to cost and flexibility
  6. 6. Basics: Search engine <ul><li>We chose Xapian </li></ul><ul><li>Open source (GPL) </li></ul><ul><li>Probabilistic ranking </li></ul><ul><li>Fast </li></ul><ul><li>Highly customisable with C++ API </li></ul><ul><li>We have over a decade of experience with it </li></ul>www.xapian.org
  7. 7. Basics: Indexing <ul><li>Implemented in Python </li></ul><ul><li>Rapid development </li></ul><ul><li>Readability and maintainability </li></ul><ul><li>Extractors use 'headless' OpenOffice .org processes </li></ul><ul><li>PDFs handled by pdftotext </li></ul><ul><li>Can scan and update entire corpus in 1 day </li></ul>
  8. 8. Basics: Front end <ul><li>Web app implemented in Python </li></ul><ul><li>WSGI : mod_wsgi on Apache 2 </li></ul><ul><li>User authentication via mod_authnz_external / pam </li></ul><ul><li>Not very fast! but Xapian does all of the heavy lifting </li></ul>
  9. 9. User tagging <ul><li>Web app writes temporary file containing the tag info </li></ul><ul><li>Indexer watches directory with inotify </li></ul><ul><li>Indexer updates document terms immediately </li></ul><ul><li>Changes visible to search within seconds </li></ul>
  10. 10. Access Control: Plan A <ul><li>Store ACLs and Unix permissions with each document in the index </li></ul><ul><li>Use a Xapian MatchDecider to filter search results by evaluating permissions for each user/document </li></ul><ul><li>(evaluation at search time) </li></ul>
  11. 11. Access Control: Plan A <ul><li>Store ACLs and Unix permissions with each document in the index </li></ul><ul><li>Use a Xapian MatchDecider to filter search results by evaluating permissions for each user/document </li></ul><ul><li>(evaluation at search time) </li></ul>BUT: <ul><li>This does not take account of permissions of parent directories </li></ul><ul><li>Noticeable overhead </li></ul>
  12. 12. Access Control: Plan B <ul><li>Check whether current user can read each file directly from the file server </li></ul><ul><li>This has the advantage of behaving exactly like the file system </li></ul><ul><li>No indexing lag </li></ul><ul><li>(evaluation at search time) </li></ul>
  13. 13. Access Control: Plan B <ul><li>Check whether current user can read each file directly from the file server </li></ul><ul><li>This has the advantage of behaving exactly like the file system </li></ul><ul><li>No indexing lag </li></ul><ul><li>(evaluation at search time) </li></ul>BUT: <ul><li>Very slow! </li></ul>
  14. 14. Access Control: Plan C <ul><li>At indexing time, iterate user list for each document, and check readability with OS </li></ul><ul><li>Store a term for each user with each readable document </li></ul><ul><li>At search time, use this term as a Boolean filter </li></ul><ul><li>(evaluation at index time) </li></ul>
  15. 15. Access Control: Plan C <ul><li>At indexing time, iterate user list for each document, and check readability with OS </li></ul><ul><li>Store a term for each user with each readable document </li></ul><ul><li>At search time, use this term as a Boolean filter </li></ul><ul><li>(evaluation at index time) </li></ul><ul><li>Very fast! Customer was happy with this solution. </li></ul>BUT: <ul><li>Would be impractical for large user lists </li></ul><ul><li>Indexer lag of up to 1 day (in this installation) </li></ul>
  16. 16. Hardware <ul><li>1 x Dell™ PowerEdge™ R710 Rack Mount Server </li></ul><ul><li>2 x QuadCore E2550 Intel processors @ 2.26GHz </li></ul><ul><li>6 x 300GB disks </li></ul><ul><li>Runs all indexing and search processes </li></ul>
  17. 17. Result
  18. 18. What did we learn? <ul><li>Access Control is not trivial </li></ul><ul><li>The first approach isn't always (usually) the best </li></ul><ul><li>Compromise is essential unless the budget is infinite </li></ul><ul><li>There are many other possibilities we have not explored </li></ul><ul><li>Open Source can make a happy customer (but we knew that already!) </li></ul>[email_address] www.flax.co.uk @FlaxSearch Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×