Faceted Search – the 120 Million Documents Story
Who am I? <ul><li>My (Buddhist) name is Upayavira </li></ul><ul><li>Consultant with Sourcesense, specialising in search an...
Who are Sourcesense? <ul><li>Open Source integrator, specialising in: </li></ul><ul><ul><ul><li>Search </li></ul></ul></ul...
Committers and Contributors <ul><li>Search: </li></ul><ul><ul><ul><li>Lucene/Solr – contributor </li></ul></ul></ul><ul><u...
Who is the customer? <ul><li>News search provider </li></ul><ul><li>Industry leader </li></ul><ul><li>Has 100s of servers ...
Their story? <ul><li>Aim: fast and timely search across broad range of content </li></ul><ul><li>Refreshing of their infra...
The Solution:
The Solution: Apache Solr <ul><li>Fast inverted index </li></ul><ul><li>Flexible configurable schema </li></ul><ul><li>Fac...
How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul>...
How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul>...
How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul>...
How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul>...
How Solr Works Index
How Solr Works Index Index Snapshot Active Index Reader Searches
How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active  Index Writer
How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active  Index Writer commit
How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active  Index Wri...
How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active  Index Wri...
How Solr Works Index Index Snapshot Index Reader Searches New Content Active  Index Writer
How Solr Distributes <ul><li>Too many: </li></ul><ul><ul><li>Documents for one index </li></ul></ul><ul><ul><li>Requests f...
Solr Host Configuration shard 1 shard 2 shard   3 searches
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Solr at The Customer <ul><li>10 shards </li></ul><ul><li>1 co-ordinator </li></ul><ul><li>60m documents </li></ul><ul><li>...
Oops. OutOfMemoryError
Solr: a Java web application <ul><li>Runs in a Java VM </li></ul><ul><li>JVM manages memory: garbage collection </li></ul>...
How Solr Works Index Index Snapshot Searches New Content Active  Index Writer Active Index Reader
How Solr Works Index Index Snapshot Searches New Content Active  Index Writer cache Active Index Reader
How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content cache Active Index Reader cache commi...
How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content Active  Index Writer cache Active Ind...
How Solr Works Index Index Snapshot Index Reader Searches New Content Active  Index Writer cache
Optimisation #1: autowarm < listener   event = &quot;newSearcher&quot;   class = &quot;solr.QuerySenderListener&quot; > < ...
#2: Garbage collection <ul><li>jstat -gcutil </li></ul><ul><li>S0  S1  E  O  P  YGC  YGCT  FGC  FGCT  GCT    0.00  91.16  ...
#3: Profiling <ul><li>Only needed because of custom components </li></ul><ul><li>Discovered 1Gb of cache reallocated every...
Managing So Many Hosts <ul><li>With eleven (or 22) hosts, manual building prohibitive </li></ul><ul><li>Hosted at Amazon E...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator 35Gb 3...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator Entire...
Content Archiving <ul><li>Important to have the ability to re-index </li></ul><ul><li>Kept a copy of all content pre-inges...
Being Dynamic <ul><li>Schema changes require re-indexing </li></ul><ul><li>Resharding requires re-indexing </li></ul><ul><...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator shard ...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator shard ...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator shard ...
Solr Host Configuration shard 1 shard 2 shard   3 co-ordinator load balancer shard 1 shard 2 shard   3 co-ordinator
Conclusion <ul><li>Service live, and increasing towards 120m docs </li></ul><ul><li>Faceted queries between 1s and 2s.  </...
thank you [email_address]
Upcoming SlideShare
Loading in...5
×

Faceted Search – the 120 Million Documents Story

2,628

Published on

Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.

The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,628
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Faceted Search – the 120 Million Documents Story"

  1. 1. Faceted Search – the 120 Million Documents Story
  2. 2. Who am I? <ul><li>My (Buddhist) name is Upayavira </li></ul><ul><li>Consultant with Sourcesense, specialising in search and operational technologies </li></ul><ul><li>A member of the Apache Software Foundation </li></ul>
  3. 3. Who are Sourcesense? <ul><li>Open Source integrator, specialising in: </li></ul><ul><ul><ul><li>Search </li></ul></ul></ul><ul><ul><ul><li>Business Intelligence </li></ul></ul></ul><ul><ul><ul><li>Content Management </li></ul></ul></ul><ul><ul><ul><li>Application Lifecycle Management </li></ul></ul></ul><ul><li>Offices in London, Amsterdam, Milan and Rome </li></ul>
  4. 4. Committers and Contributors <ul><li>Search: </li></ul><ul><ul><ul><li>Lucene/Solr – contributor </li></ul></ul></ul><ul><ul><ul><li>Hibernate Search – committer </li></ul></ul></ul><ul><ul><ul><li>Lucene Infinispan integration – lead developer </li></ul></ul></ul><ul><ul><ul><li>Apache UIMA – committer </li></ul></ul></ul><ul><li>CMS: </li></ul><ul><ul><ul><li>Apache Chemistry – contributor </li></ul></ul></ul><ul><ul><ul><li>Apache Jackrabbit – contributor </li></ul></ul></ul><ul><ul><ul><li>JBoss GateIn Portal – committer </li></ul></ul></ul><ul><ul><ul><li>OpenSSO-Alfresco - contributor </li></ul></ul></ul>
  5. 5. Who is the customer? <ul><li>News search provider </li></ul><ul><li>Industry leader </li></ul><ul><li>Has 100s of servers crawling 1.7m sites each day </li></ul><ul><li>2.5m documents (news and social media) each day </li></ul><ul><li>Keeping 2m/day, 1 month = 60m, 2 months = 120m </li></ul><ul><li>Existing tech old and fragile </li></ul>
  6. 6. Their story? <ul><li>Aim: fast and timely search across broad range of content </li></ul><ul><li>Refreshing of their infrastructure: maintainable </li></ul><ul><li>Features: </li></ul><ul><ul><li>Integrated search across news/social media content </li></ul></ul><ul><ul><li>Faceting </li></ul></ul><ul><ul><li>Geospacial search </li></ul></ul><ul><ul><li>Deduplication </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
  7. 7. The Solution:
  8. 8. The Solution: Apache Solr <ul><li>Fast inverted index </li></ul><ul><li>Flexible configurable schema </li></ul><ul><li>Faceted search </li></ul><ul><li>Sharding for scalability </li></ul><ul><li>Missing: </li></ul><ul><ul><li>Clustering (Google news style) </li></ul></ul><ul><ul><li>Geolocation </li></ul></ul><ul><ul><li>Deduplication </li></ul></ul><ul><ul><li>Relevance </li></ul></ul>
  9. 9. How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul><ul><li> </li></ul>
  10. 10. How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul><ul><li><result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result> </li></ul>
  11. 11. How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul><ul><li><result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result> </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook&facet=true&facet.field=make </li></ul></ul>
  12. 12. How Solr Works <ul><li>A RESTful web service </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook </li></ul></ul><ul><li><result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result> </li></ul><ul><ul><li>http://solr:8983/solr/select?q=notebook&facet=true&facet.field=make </li></ul></ul><ul><li><lst name=&quot;facet_counts&quot;> <lst name=&quot;facet_fields&quot;> <lst name=&quot;make&quot;> <int name=&quot;Lenovo&quot;>24</int> <int name=&quot;Apple&quot;>12</int> … </lst> </lst> </lst> </li></ul>
  13. 13. How Solr Works Index
  14. 14. How Solr Works Index Index Snapshot Active Index Reader Searches
  15. 15. How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer
  16. 16. How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer commit
  17. 17. How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
  18. 18. How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
  19. 19. How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer
  20. 20. How Solr Distributes <ul><li>Too many: </li></ul><ul><ul><li>Documents for one index </li></ul></ul><ul><ul><li>Requests for one server </li></ul></ul><ul><ul><li>Chances of failure </li></ul></ul><ul><li>Shards: splitting each index into parts </li></ul><ul><li>Rows: duplicating each index </li></ul>
  21. 21. Solr Host Configuration shard 1 shard 2 shard 3 searches
  22. 22. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator
  23. 23. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer
  24. 24. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  25. 25. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  26. 26. Solr at The Customer <ul><li>10 shards </li></ul><ul><li>1 co-ordinator </li></ul><ul><li>60m documents </li></ul><ul><li>350Gb indexes (35Gb each shard) </li></ul><ul><li>Indexer code written </li></ul><ul><li>Custom components implemented </li></ul><ul><li>Initial query time 20s </li></ul><ul><li>But soon: </li></ul>
  27. 27. Oops. OutOfMemoryError
  28. 28. Solr: a Java web application <ul><li>Runs in a Java VM </li></ul><ul><li>JVM manages memory: garbage collection </li></ul><ul><li>JVM allocates memory into buckets </li></ul><ul><li>On JVM startup: specify memory allocations </li></ul>
  29. 29. How Solr Works Index Index Snapshot Searches New Content Active Index Writer Active Index Reader
  30. 30. How Solr Works Index Index Snapshot Searches New Content Active Index Writer cache Active Index Reader
  31. 31. How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content cache Active Index Reader cache commit Active Index Writer
  32. 32. How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content Active Index Writer cache Active Index Reader cache
  33. 33. How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer cache
  34. 34. Optimisation #1: autowarm < listener event = &quot;newSearcher&quot; class = &quot;solr.QuerySenderListener&quot; > < arr name = &quot;queries&quot; > < lst > < str name = &quot;q&quot; > solr </ str > < str name = &quot;relf&quot; > 4 </ str > < str name = &quot;facet.field&quot; > sourceCountryCS </ str > < str name = &quot;facet.field&quot; > entityCSPerson </ str > < str name = &quot;facet.field&quot; > entityCSCompany </ str > < str name = &quot;facet.field&quot; > entityCSProduct </ str > < str name = &quot;facet.field&quot; > sourceCS </ str > < str name = &quot;facet.field&quot; > authorCS </ str > < str name = &quot;facet.field&quot; > stockTickerCS </ str > < str name = &quot;facet.field&quot; > feedClassCS </ str > < str name = &quot;facet.field&quot; > entityCSOrganization </ str > < str name = &quot;facet.field&quot; > platformCS </ str > < str name = &quot;facet.field&quot; > eventOrFactCS </ str > < str name = &quot;facet.field&quot; > sourceRank </ str > < str name = &quot;facet&quot; > true </ str > < str name = &quot;facet.date&quot; > harvestDate </ str > < str name = &quot;facet.date.start&quot; > NOW-1MONTH </ str > < str name = &quot;facet.date.end&quot; > NOW </ str > < str name = &quot;facet.date.gap&quot; > +24HOURS </ str > < str name = &quot;qt&quot; > /duplicate </ str > < str name = &quot;duplicateOrder&quot; > latest </ str > < str name = &quot;collapseFields&quot; > duplicateGroup titleForDuplicates </ str > </ lst > </ arr > </ listener >
  35. 35. #2: Garbage collection <ul><li>jstat -gcutil </li></ul><ul><li>S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 91.16 8.77 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.05 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.61 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.61 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 </li></ul><ul><li>MaxSearchers: </li></ul><ul><li><maxWarmingSearchers>1</maxWarmingSearchers> </li></ul><ul><li>Rebalanced the memory allocation for JVM: </li></ul><ul><li>-XX:SurvivorRatio=5 -XX:NewRatio=2 -Xmx5300m -Xms5300m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxPermSize=256m </li></ul>
  36. 36. #3: Profiling <ul><li>Only needed because of custom components </li></ul><ul><li>Discovered 1Gb of cache reallocated every commit </li></ul><ul><li>Mostly a key, stored as a string </li></ul><ul><li>Converted to a 'long' number </li></ul><ul><li>Reduced to 100Mb </li></ul>
  37. 37. Managing So Many Hosts <ul><li>With eleven (or 22) hosts, manual building prohibitive </li></ul><ul><li>Hosted at Amazon EC2 </li></ul><ul><li>Scripted instantiation and configuration: </li></ul><ul><ul><li>installing java, creating/mounting partition, etc </li></ul></ul><ul><li>Run concurrently </li></ul><ul><ul><li>Result: 11 hosts available in six minutes </li></ul></ul>
  38. 38. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer
  39. 39. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  40. 40. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator 35Gb 35Gb 35Gb
  41. 41. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  42. 42. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator Entire row: 40 minutes
  43. 43. Content Archiving <ul><li>Important to have the ability to re-index </li></ul><ul><li>Kept a copy of all content pre-ingestion </li></ul><ul><li>Built a tool to ingest this archive </li></ul><ul><li>30 minutes for 1 day, 30 hours for 2 months </li></ul>
  44. 44. Being Dynamic <ul><li>Schema changes require re-indexing </li></ul><ul><li>Resharding requires re-indexing </li></ul><ul><li>Automation gave major benefit: </li></ul><ul><ul><li>Can deploy an additional row and reindex </li></ul></ul>
  45. 45. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  46. 46. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator
  47. 47. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator archive ingestion
  48. 48. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator
  49. 49. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  50. 50. Conclusion <ul><li>Service live, and increasing towards 120m docs </li></ul><ul><li>Faceted queries between 1s and 2s. </li></ul><ul><li>Term queries 500ms. </li></ul><ul><li>GC time down from >5000s to 700s/day </li></ul><ul><li>To do still: </li></ul><ul><ul><li>Clustering </li></ul></ul>
  51. 51. thank you [email_address]

×