• Save
Faceted Search – the 120 Million Documents Story
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Faceted Search – the 120 Million Documents Story

on

  • 2,808 views

Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability. ...

Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.

The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.

Statistics

Views

Total Views
2,808
Views on SlideShare
2,780
Embed Views
28

Actions

Likes
5
Downloads
0
Comments
1

1 Embed 28

http://www.techgig.com 28

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Faceted Search – the 120 Million Documents Story Presentation Transcript

  • 1. Faceted Search – the 120 Million Documents Story
  • 2. Who am I?
    • My (Buddhist) name is Upayavira
    • Consultant with Sourcesense, specialising in search and operational technologies
    • A member of the Apache Software Foundation
  • 3. Who are Sourcesense?
    • Open Source integrator, specialising in:
        • Search
        • Business Intelligence
        • Content Management
        • Application Lifecycle Management
    • Offices in London, Amsterdam, Milan and Rome
  • 4. Committers and Contributors
    • Search:
        • Lucene/Solr – contributor
        • Hibernate Search – committer
        • Lucene Infinispan integration – lead developer
        • Apache UIMA – committer
    • CMS:
        • Apache Chemistry – contributor
        • Apache Jackrabbit – contributor
        • JBoss GateIn Portal – committer
        • OpenSSO-Alfresco - contributor
  • 5. Who is the customer?
    • News search provider
    • Industry leader
    • Has 100s of servers crawling 1.7m sites each day
    • 2.5m documents (news and social media) each day
    • Keeping 2m/day, 1 month = 60m, 2 months = 120m
    • Existing tech old and fragile
  • 6. Their story?
    • Aim: fast and timely search across broad range of content
    • Refreshing of their infrastructure: maintainable
    • Features:
      • Integrated search across news/social media content
      • Faceting
      • Geospacial search
      • Deduplication
      • Clustering
  • 7. The Solution:
  • 8. The Solution: Apache Solr
    • Fast inverted index
    • Flexible configurable schema
    • Faceted search
    • Sharding for scalability
    • Missing:
      • Clustering (Google news style)
      • Geolocation
      • Deduplication
      • Relevance
  • 9. How Solr Works
    • A RESTful web service
      • http://solr:8983/solr/select?q=notebook
  • 10. How Solr Works
    • A RESTful web service
      • http://solr:8983/solr/select?q=notebook
    • <result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result>
  • 11. How Solr Works
    • A RESTful web service
      • http://solr:8983/solr/select?q=notebook
    • <result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result>
      • http://solr:8983/solr/select?q=notebook&facet=true&facet.field=make
  • 12. How Solr Works
    • A RESTful web service
      • http://solr:8983/solr/select?q=notebook
    • <result name=&quot;response&quot; numFound=&quot;649&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Thinkpad</str> <str name=&quot;make&quot;>Lenovo</str> </doc> … </result>
      • http://solr:8983/solr/select?q=notebook&facet=true&facet.field=make
    • <lst name=&quot;facet_counts&quot;> <lst name=&quot;facet_fields&quot;> <lst name=&quot;make&quot;> <int name=&quot;Lenovo&quot;>24</int> <int name=&quot;Apple&quot;>12</int> … </lst> </lst> </lst>
  • 13. How Solr Works Index
  • 14. How Solr Works Index Index Snapshot Active Index Reader Searches
  • 15. How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer
  • 16. How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer commit
  • 17. How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
  • 18. How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
  • 19. How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer
  • 20. How Solr Distributes
    • Too many:
      • Documents for one index
      • Requests for one server
      • Chances of failure
    • Shards: splitting each index into parts
    • Rows: duplicating each index
  • 21. Solr Host Configuration shard 1 shard 2 shard 3 searches
  • 22. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator
  • 23. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer
  • 24. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 25. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 26. Solr at The Customer
    • 10 shards
    • 1 co-ordinator
    • 60m documents
    • 350Gb indexes (35Gb each shard)
    • Indexer code written
    • Custom components implemented
    • Initial query time 20s
    • But soon:
  • 27. Oops. OutOfMemoryError
  • 28. Solr: a Java web application
    • Runs in a Java VM
    • JVM manages memory: garbage collection
    • JVM allocates memory into buckets
    • On JVM startup: specify memory allocations
  • 29. How Solr Works Index Index Snapshot Searches New Content Active Index Writer Active Index Reader
  • 30. How Solr Works Index Index Snapshot Searches New Content Active Index Writer cache Active Index Reader
  • 31. How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content cache Active Index Reader cache commit Active Index Writer
  • 32. How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content Active Index Writer cache Active Index Reader cache
  • 33. How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer cache
  • 34. Optimisation #1: autowarm < listener event = &quot;newSearcher&quot; class = &quot;solr.QuerySenderListener&quot; > < arr name = &quot;queries&quot; > < lst > < str name = &quot;q&quot; > solr </ str > < str name = &quot;relf&quot; > 4 </ str > < str name = &quot;facet.field&quot; > sourceCountryCS </ str > < str name = &quot;facet.field&quot; > entityCSPerson </ str > < str name = &quot;facet.field&quot; > entityCSCompany </ str > < str name = &quot;facet.field&quot; > entityCSProduct </ str > < str name = &quot;facet.field&quot; > sourceCS </ str > < str name = &quot;facet.field&quot; > authorCS </ str > < str name = &quot;facet.field&quot; > stockTickerCS </ str > < str name = &quot;facet.field&quot; > feedClassCS </ str > < str name = &quot;facet.field&quot; > entityCSOrganization </ str > < str name = &quot;facet.field&quot; > platformCS </ str > < str name = &quot;facet.field&quot; > eventOrFactCS </ str > < str name = &quot;facet.field&quot; > sourceRank </ str > < str name = &quot;facet&quot; > true </ str > < str name = &quot;facet.date&quot; > harvestDate </ str > < str name = &quot;facet.date.start&quot; > NOW-1MONTH </ str > < str name = &quot;facet.date.end&quot; > NOW </ str > < str name = &quot;facet.date.gap&quot; > +24HOURS </ str > < str name = &quot;qt&quot; > /duplicate </ str > < str name = &quot;duplicateOrder&quot; > latest </ str > < str name = &quot;collapseFields&quot; > duplicateGroup titleForDuplicates </ str > </ lst > </ arr > </ listener >
  • 35. #2: Garbage collection
    • jstat -gcutil
    • S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 91.16 8.77 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.05 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.29 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.61 86.49 59.59 13075 3832.216 2598 1594.978 5427.193 0.00 91.16 9.61 86.49 59.59 13075 3832.216 2598 1594.978 5427.193
    • MaxSearchers:
    • <maxWarmingSearchers>1</maxWarmingSearchers>
    • Rebalanced the memory allocation for JVM:
    • -XX:SurvivorRatio=5 -XX:NewRatio=2 -Xmx5300m -Xms5300m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:MaxPermSize=256m
  • 36. #3: Profiling
    • Only needed because of custom components
    • Discovered 1Gb of cache reallocated every commit
    • Mostly a key, stored as a string
    • Converted to a 'long' number
    • Reduced to 100Mb
  • 37. Managing So Many Hosts
    • With eleven (or 22) hosts, manual building prohibitive
    • Hosted at Amazon EC2
    • Scripted instantiation and configuration:
      • installing java, creating/mounting partition, etc
    • Run concurrently
      • Result: 11 hosts available in six minutes
  • 38. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer
  • 39. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 40. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator 35Gb 35Gb 35Gb
  • 41. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 42. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator Entire row: 40 minutes
  • 43. Content Archiving
    • Important to have the ability to re-index
    • Kept a copy of all content pre-ingestion
    • Built a tool to ingest this archive
    • 30 minutes for 1 day, 30 hours for 2 months
  • 44. Being Dynamic
    • Schema changes require re-indexing
    • Resharding requires re-indexing
    • Automation gave major benefit:
      • Can deploy an additional row and reindex
  • 45. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 46. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator
  • 47. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator archive ingestion
  • 48. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator shard 1 shard 2 shard 3 co-ordinator
  • 49. Solr Host Configuration shard 1 shard 2 shard 3 co-ordinator load balancer shard 1 shard 2 shard 3 co-ordinator
  • 50. Conclusion
    • Service live, and increasing towards 120m docs
    • Faceted queries between 1s and 2s.
    • Term queries 500ms.
    • GC time down from >5000s to 700s/day
    • To do still:
      • Clustering
  • 51. thank you [email_address]