Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket


Published on

Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.

  • Be the first to comment

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

  1. 1. Kyungseog OhMay 22, 2012HBaseCon
  2. 2. Solbase is an open-source, real-timesearch platform based on Lucene,Solr and HBase built at PhotobucketWhat is Solbase?
  3. 3. • 40% of total page views• 500 million ‘docs’ or images• 30 million search requests per day• 120 Gigabyte size• Previous infrastructure built on Solr/LuceneSearch at Photobucket
  4. 4. • Memory issues• Indexing time• Speed• Capacity and ScalabilityWhy Solbase?
  5. 5. • Field Cache – Sortable and filterable fields stored in a java array the size of the maximum document number• Example – Every doc is sorted by an integer field, for 500 million documents the array is 2 GB in sizeLucene Memory Issues
  6. 6. • Solr indexing took 15-16 hours to rebuild the indices• We wanted to provide near real-time updatesIndexing Time
  7. 7. • Every 100 ms improvement in response time equates to approximately 1 extra page view per visit.• Can end up being hundreds of millions of extra page views per monthSpeed
  8. 8. • Impractical to add significant number of new docs and data (Geo, Exif, etc)• Difficult to divide data set to create brand new shard• Fault tolerance is not built inCapacity & Scalability
  9. 9. Modify Lucene and Solr to use HBase asthe source of index and documentdataThe Concept
  11. 11. Term Queries are HBase range scansStart key<field><delimiter><term><delimiter><begin doc id>0x00000000End key<field><delimiter><term><delimiter><end doc id>0xffffffffQuery Methodology
  12. 12. SolrSharding Master Shard Shard Shard Shard Index File Index File Index File Index FileSolbaseSharding Master Shard Shard Shard Shard HBaseSolbase – Distributed Processing
  13. 13. • Extra bits in Encoded Metadata • Solved Lucene’s sort/filter field cache issueSolbase – Sorts & Filters
  14. 14. • Initial Indexing – Leveraging Map/Reduce Framework• Real-Time Indexing – Using Solr’s update APISolbase – Indexing Process
  15. 15. • Term ‘me’ takes 13 seconds to load from HBase, 500 ms from cache – ‘me’ has ~14M docs, the largest term in our indices• Most terms not in cache take < 200 ms• Most cached terms take < 20 ms• Average query time for native Solr/Lucene: 169 ms• Average query time for Solbase: 109 ms or 35% decrease• ~300 real-time updates per secondResults
  16. 16. • Compatibility issue with latest Solr• CDH3 latest build• HBase/Solbase clusters per data centerHBase configuration/Limitation
  17. 17. • ase• ase-Lucene• ase-SolrRepos
  18. 18. Q&A