Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Efficient Scalable Search in a 
Multi-Tenant Environment 
Harry Hight 
©2014 Bloomberg L.P.
Overview 
• Background 
• Architecture 
• Scale 
• Security 
• Questions
Background 
• Bloomberg Vault – hosted communication archive 
• Explosive 
growth 
of 
enterprise 
data 
communica7ons 
• ...
Sizing 
• 80 billion documents 
• And 
growing 
• Average document size is 50KB 
• Large 
variance 
-­‐ 
1KB 
to 
hundreds...
Overview 
• Background 
• Architecture 
• Scale 
• Security 
• Questions
Architecture 
• Massive scale - shards have to be left 
offline until needed 
• Load only the shards needed to serve 
a se...
Overview 
• Background 
• Architecture 
• Scale 
• Security 
• Questions
Incremental Search 
• Calculating the full result set is time 
consuming 
• Query 
cache 
usually 
cold 
due 
to 
unload 
...
Pinned Shards 
• Incremental search starts with the most recent data 
• `Pin` shards for most recent data 
• Subset 
of 
s...
Overview 
• Background 
• Architecture 
• Scale 
• Security 
• Questions
Security 
• What if each user has a different view 
of a document? 
• User 
1 
has 
permission 
to 
view 
the 
red 
• User...
Security 
• Post process each document 
• Ends 
up 
being 
horribly 
slow 
• Ties 
applica7on 
logic 
to 
backend 
• Gener...
Questions 
?
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P.
Upcoming SlideShare
Loading in …5
×

Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P.

5,327 views

Published on

Presented at Lucene/Solr Revolution 2014

Published in: Software
  • Be the first to comment

Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry Hight, Bloomberg L.P.

  1. 1. Efficient Scalable Search in a Multi-Tenant Environment Harry Hight ©2014 Bloomberg L.P.
  2. 2. Overview • Background • Architecture • Scale • Security • Questions
  3. 3. Background • Bloomberg Vault – hosted communication archive • Explosive growth of enterprise data communica7ons • Compliance for Regulated Industries (e.g. e-­‐mail, chat, mobile, voice, social media, files) • Private Cloud • E-Discovery - large historical data sets, but small query volume • Search to accurately and 7mely respond to li7ga7on requests • Reconstruct communica7ons across all channels and types • Extrac7on of large data sets from special storage (WORM) Query User Index Results Extrac7on
  4. 4. Sizing • 80 billion documents • And growing • Average document size is 50KB • Large variance -­‐ 1KB to hundreds of MB • Hundreds of indexed fields • There is a lot of metadata that goes along with communica7on • <10 searches/second
  5. 5. Overview • Background • Architecture • Scale • Security • Questions
  6. 6. Architecture • Massive scale - shards have to be left offline until needed • Load only the shards needed to serve a search request • Searches normally require ~30 shards, but can range from 1 to several hundred depending on applica7on • Open shards cached in case they are needed again • Indexing is an external batch process Solr Solr Solr Shards Solr Search Manager Shard Mapping
  7. 7. Overview • Background • Architecture • Scale • Security • Questions
  8. 8. Incremental Search • Calculating the full result set is time consuming • Query cache usually cold due to unload • Shards load takes 7me • Users want to review a subset before exporting • Shards and results are date sorted • Search shards sequentially, and return partial results as available • Creates a streaming interface Applica7ons Solr Solr Solr Shards Solr Search Manager
  9. 9. Pinned Shards • Incremental search starts with the most recent data • `Pin` shards for most recent data • Subset of shards to be kept loaded at all 7mes • Shards already loaded for the beginning of the stream • User doesn’t see the load times for the rest since it happens while they review initial results • Allows query caches to be more effective • User sees results in seconds rather than minutes
  10. 10. Overview • Background • Architecture • Scale • Security • Questions
  11. 11. Security • What if each user has a different view of a document? • User 1 has permission to view the red • User 2 has permission to view green • User 3 has permission to view everything Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercita7on ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum
  12. 12. Security • Post process each document • Ends up being horribly slow • Ties applica7on logic to backend • Generate a unique document for each view • 1000s of unique views makes for an unmanageable index • Trillions of documents is a whole different problem! • Dynamic fields • text_view1:value1, text_view2:value2, text_view3:”value1 value2” • Solr doesn’t have a max number of fields, but string interning becomes an issue • Mangle field values • text:”view1_value1 view2_value2 view3_value1 view3_value2” • Works pre^y well
  13. 13. Questions ?

×