SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.
The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.
Upayavira's presentation at Online Information 2010 in London: the case study of an Enterprise-critical migration from custom Lucene indexes to Apache Solr, with a significant focus on scalability.
The solution needed to providing search against rapidly changing data-sets and multi-million document indexes, enabling complex queries with sub second responses and maintaining high availability.
1.
Faceted Search – the 120 Million Documents Story
2.
Who am I? <ul><li>My (Buddhist) name is Upayavira </li></ul><ul><li>Consultant with Sourcesense, specialising in search and operational technologies </li></ul><ul><li>A member of the Apache Software Foundation </li></ul>
3.
Who are Sourcesense? <ul><li>Open Source integrator, specialising in: </li></ul><ul><ul><ul><li>Search </li></ul></ul></ul><ul><ul><ul><li>Business Intelligence </li></ul></ul></ul><ul><ul><ul><li>Content Management </li></ul></ul></ul><ul><ul><ul><li>Application Lifecycle Management </li></ul></ul></ul><ul><li>Offices in London, Amsterdam, Milan and Rome </li></ul>
5.
Who is the customer? <ul><li>News search provider </li></ul><ul><li>Industry leader </li></ul><ul><li>Has 100s of servers crawling 1.7m sites each day </li></ul><ul><li>2.5m documents (news and social media) each day </li></ul><ul><li>Keeping 2m/day, 1 month = 60m, 2 months = 120m </li></ul><ul><li>Existing tech old and fragile </li></ul>
6.
Their story? <ul><li>Aim: fast and timely search across broad range of content </li></ul><ul><li>Refreshing of their infrastructure: maintainable </li></ul><ul><li>Features: </li></ul><ul><ul><li>Integrated search across news/social media content </li></ul></ul><ul><ul><li>Faceting </li></ul></ul><ul><ul><li>Geospacial search </li></ul></ul><ul><ul><li>Deduplication </li></ul></ul><ul><ul><li>Clustering </li></ul></ul>
14.
How Solr Works Index Index Snapshot Active Index Reader Searches
15.
How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer
16.
How Solr Works Index Index Snapshot Active Index Reader Searches New Content Active Index Writer commit
17.
How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
18.
How Solr Works Index Index Snapshot Index Snapshot Index Reader Active Index Reader Searches New Content Active Index Writer
19.
How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer
20.
How Solr Distributes <ul><li>Too many: </li></ul><ul><ul><li>Documents for one index </li></ul></ul><ul><ul><li>Requests for one server </li></ul></ul><ul><ul><li>Chances of failure </li></ul></ul><ul><li>Shards: splitting each index into parts </li></ul><ul><li>Rows: duplicating each index </li></ul>
28.
Solr: a Java web application <ul><li>Runs in a Java VM </li></ul><ul><li>JVM manages memory: garbage collection </li></ul><ul><li>JVM allocates memory into buckets </li></ul><ul><li>On JVM startup: specify memory allocations </li></ul>
29.
How Solr Works Index Index Snapshot Searches New Content Active Index Writer Active Index Reader
30.
How Solr Works Index Index Snapshot Searches New Content Active Index Writer cache Active Index Reader
31.
How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content cache Active Index Reader cache commit Active Index Writer
32.
How Solr Works Index Index Snapshot Index Snapshot Index Reader Searches New Content Active Index Writer cache Active Index Reader cache
33.
How Solr Works Index Index Snapshot Index Reader Searches New Content Active Index Writer cache
36.
#3: Profiling <ul><li>Only needed because of custom components </li></ul><ul><li>Discovered 1Gb of cache reallocated every commit </li></ul><ul><li>Mostly a key, stored as a string </li></ul><ul><li>Converted to a 'long' number </li></ul><ul><li>Reduced to 100Mb </li></ul>
37.
Managing So Many Hosts <ul><li>With eleven (or 22) hosts, manual building prohibitive </li></ul><ul><li>Hosted at Amazon EC2 </li></ul><ul><li>Scripted instantiation and configuration: </li></ul><ul><ul><li>installing java, creating/mounting partition, etc </li></ul></ul><ul><li>Run concurrently </li></ul><ul><ul><li>Result: 11 hosts available in six minutes </li></ul></ul>
43.
Content Archiving <ul><li>Important to have the ability to re-index </li></ul><ul><li>Kept a copy of all content pre-ingestion </li></ul><ul><li>Built a tool to ingest this archive </li></ul><ul><li>30 minutes for 1 day, 30 hours for 2 months </li></ul>
44.
Being Dynamic <ul><li>Schema changes require re-indexing </li></ul><ul><li>Resharding requires re-indexing </li></ul><ul><li>Automation gave major benefit: </li></ul><ul><ul><li>Can deploy an additional row and reindex </li></ul></ul>
50.
Conclusion <ul><li>Service live, and increasing towards 120m docs </li></ul><ul><li>Faceted queries between 1s and 2s. </li></ul><ul><li>Term queries 500ms. </li></ul><ul><li>GC time down from >5000s to 700s/day </li></ul><ul><li>To do still: </li></ul><ul><ul><li>Clustering </li></ul></ul>