What is Solbase? An Overview of Solbase Architecture and Implementation

• What and who is Photobucket
• A brief overview of some relevant basic
search concepts
• A brief overview of Lucene, Solr and HBase
• Solbase architecture and implementation
• Where we’re at, how we’re doing

Discussion Overview

• Photobucket is the most-visited photo site
with 23.4 Million UVs
• Over 9 Billion photos stored!
• Users upload 4 Million images per day!
• Photobucket users spend more time than
any other photo site with 3.8 Avg mins/visit
• 2.0 Million avg daily visitors - more daily visits
than Flickr and Picasa combined

Photobucket Overview
Sources: 1comScore May 2011, 2Internal data

23.4M UVs

19.7M UVs

9.9M UVs

9.5M UVs

7.9M UVs

6.0M UVs

1.6M UVs

• 40% of total page views
• 500 million ‘docs’ or image metadata
• 120 Gigabyte size
• 35 million requests per day

Search at Photobucket

• Term
– A searchable unit of language used as
part of a search query
– e.g. ‘cosmonaut’ or ‘space traveler’
• Doc or Document
– The unit of searchable content
– e.g. For Google it’s a web page
– For Photobucket it’s an image or video

Terminology

‘blue’ : page 1, page 8, page 54, page 207
‘pants’ : page 3, page 9, page 54, page 222
‘red’ : page 8, page 7, page 22, page 54
‘sweater’ : page 8, page 7, page 22, page 54

Search Concepts – Inverted Index

• Term-frequency inverse-doc or tf-idf
scoring
– tf (term frequency): sqrt(number of times a term
appears in a document)
– Idf (inverse doc frequency): log(total number of
documents/number of occurrences of a term in
all documents)
• Length normalization: 1/sqrt(total
terms in a given document)

Search Concepts - Scoring

Lucene is an open source, high-
performance, full-featured text search
engine library written in Java

What is Lucene?

Solr is an open source, high
performance enterprise search server

What is Solr?

Web
Servers

nginx nginx nginx

Squid Squid Squid Squid Squid Squid
Cache Cache Cache Cache Cache Cache

Solr Cluster Solr Cluster
Shard Cluster
Solr Shard Shard Shard Shard Cluster
Solr Shard Shard Shard
Shard
1 Shard
2 Shard
3 Shard
4 Shard
1 Shard
2 Shard
3 Shard
4
Shard
1 Shard
2 Shard
3 Shard
4 Shard
1 Shard
2 Shard
3 Shard
4
Shard
1 Shard
2 Shard
3 Shard
4 Shard
1 Shard
2 Shard
3 Shard
4
1 2 3 4 1 2 3 4

Previous Solr/Lucene Architecture

HBase is a distributed, pure-java big-
table like database built upon
Hadoop components

Why we choose HBase?

• Scan
– Range query between start and end keys
– Keys already ordered lexicographically
– Efficient to fetch index data given term

Why we choose HBase

• Memory issues
• Indexing time
• Speed
• Capacity and Scalability

Why Solbase?

• Field Cache
– Sortable and filterable fields stored in a
java array the size of the maximum
document number
• Example
– Every doc is sorted by an integer field, for
500 million documents the array is 2 GB in
size

Lucene Memory Issues

• Previous architecture took 15-16 hours
to rebuild the indices (full build, 500
million documents, 120GB total size)
• We wanted to provide near real-time
updates

Indexing Time

• Every 100 ms improvement in response
time equates to approximately 1 extra
page view per visit.
• Can end up being hundreds of millions
of extra page views per month

Speed

• Impractical to add significant number
of new docs and data (Geo, Exif, etc)
• Difficult to divide data set to create
brand new shard
• Fault tolerance is not built in

Capacity & Scalability

Modify Lucene and Solr to use HBase as
the source of index and document
data

The Concept

Term/Document Table

<field><delimiter><term 1><delimiter>document id 1><encoded metadata>

Example:

<contents>0xffffffff<beach>0xffffffff<document id 2><encoded metadata>
<contents>0xffffffff<beach>0xffffffff<document id 3><encoded metadata>
<author>0xffffffff<luke>0xffffffff<document id 1><encoded metadata>
…

Data Layout

Term/Document Encoded Metadata in
Term/Document Table

<existence flag byte>
(<normalization byte>)
(<positions vector>)
(<offsets vector>)
(<sort/filter field 1>)

Data Layout

Document table
– Key : document ID
– Column family: field : column qualifier: term

Data Layout

Term Queries are HBase range scans
Start key
<field><delimiter><term><delimiter><begin doc id>0x00000000

End key
<field><delimiter><term><delimiter><end doc id>0xffffffff

Query Methodology

Solr Sharding
Master

Solbase Sharding
Master

Shard Shard Shard Shard

HBase

Solbase – Distributed Processing

• Extra bits in Encoded Metadata
• Solved Lucene’s sort/filter field
cache issue

Solbase – Sorts & Filters

• Initial Indexing
– Leveraging Map/Reduce Framework
• Real-Time Indexing
– Using Solr’s update API

Solbase – Indexing Process

Simple HashMap based LRU cache
– Dynamically updated
– Back ground reloaded
– Time-out settings
– Pluggable

Solbase - Caching

Web
Servers

Solbase Cluster Solbase Cluster
Solbase Cluster Solbase Cluster

Shard Shard
Shard Shard Shard Shard Shard Shard
13~1
Shard 13~1
Shard
1~4
Shard 5~8
Shard 9~12
Shard 1~4
Shard 5~8
Shard 9~12
Shard
6
13~1 6
13~1
1~4 5~8 9~12 1~4 5~8 9~12
6 6

HBase cluster
HBase cluster

Region Region Region Region Region Region
server server server server server server
Regio Regio Regio Regio Regio Regio
n n n n n n
server server server server server server

Solr to Solbase

• Replaced Lucene index file with
distributed database, Hbase
• Overcame Lucene’s inherent
limitations (memory issues) with
embedded sort/filter fields
• Moved indexing process to
map/reduce framework for faster
processing time

Summary of what we did

• Provided Real time indexing
capability
• Built more flexible/performant
caching layer
• Fixed Solr document fetching issue
• Fixed HBase client socket

Summary of what we did

• Term ‘me’ takes 13 seconds to load from
HBase, 500 ms from cache
– ‘me’ has ~14M docs, the largest term in our
indices
• Most terms not in cache take < 200 ms
• Most cached terms take < 20 ms
• Average query time for native Solr/Lucene:
169 ms
• Average query time for Solbase: 109 ms or
35% decrease
• ~300 real-time updates per second

Results

• Geo-search
• Other data products within
Photobucket, outside of search, as a
general query engine for large data
sets

Next Steps

• Activity Stream

• Reporting Tool

Other project based on HBase

• https://github.com/Photobucket/Solbase
• https://github.com/Photobucket/Solbase-
• https://github.com/Photobucket/Solbase-

Repo

What is Solbase? An Overview of Solbase Architecture and Implementation

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

What is Solbase? An Overview of Solbase Architecture and Implementation

Editor's Notes