Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Apache Solr!
Ramzi Alqrainy!
Search Guy!
Part 1!

Apache Solr!
!
is!
“ a standalone full-text search server
with Apache Lucene at the backend. “!
!
!

Cont.!
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java. !
!
In brief Apache Solr exposes Lucene's JAVA
API as REST like API's which can be called
over HTTP from any programming
language/platform.!

Features!
l  Full Text Search!
l  Faceted navigation!
l  More items like this(Recommendation)/
Related searches !
l  Spell Suggest/Auto-Complete!
l  Custom document ranking/ordering!
l  Snippet generation/highlighting!
And a lot More....!

Why Solr ?!
Also, Solr is only provides :!
1. Result Grouping / Field Collapsing!
2. Query Elevation!
3. Pivot Facet!
4. Pluggable Search/update Workflow!
5. Hash-Based Duplication!

Field Collapsing

“ Collapses a group of results with the same
field value down to a single (or fixed
number) of entries.”!
For example, most search engines such as
Google collapse on site so only one or two
entries are shown, along with a link to click
to see more results from that site. Field
collapsing can also be used to suppress
duplicate documents.!

Result Grouping

“ groups documents with a common field
value into groups, returning the top
documents per group, and the top groups
based on what documents are in the groups”!
One example is a search at Best Buy for a
common term such as DVD, that shows the
top 3 results for each category ("TVs &
Video","Movies","Computers", etc)!

Query Elevation

enables you to configure the top results for a
given query regardless of the normal lucene
scoring. This is sometimes called "sponsored
search", "editorial boosting" or "best bets".!

Pivot Facet

You can think of it as "Decision Tree
Faceting" which tells you in advance what
the "next" set of facet results would be for a
field if you apply a constraint from the
current facet results!

Pluggable Search/update Workflow

You can modify the workflow of existing API
endpoints / document instert or updates!

Hash-Based Duplication

Determining the uniqueness of a document
not based on ad ID-Field, but the hash
signature of a field.!
!
Useful for web pages for example, where the
URL may be different but the content the
same.!

Boost documents by age!
•  Just do a descending
sort by age = done?!
•  Boost more recent
d o c u m e n t s a n d
p e n a l i z e o l d e r
documents just for
being old!
•  U s e f u l f o r n e w s ,
business docs, and
local search !

Solr: Indexing!
In schema.xml:
<fieldType name="tdate"
class="solr.TrieDateField"
omitNorms="true"
precisionStep="6"
positionIncrementGap="0"/>
<field name="pubdate"
type="tdate"
indexed="true"
stored="true"
required="true" />
Date published =
DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);

FunctionQuery Basics!
•  FunctionQuery: Computes a value for each
document!
– Ranking!
– Sorting!
constant
literal
fieldvalue
ord
rord
sum
sub
product
pow
abs
log
sqrt
map
scale
query
linear
recip
max
min
ms
sqedist - Squared Euclidean Dist
hsin, ghhsin - Haversine Formula
geohash - Convert to geohash
strdist

Solr: Query Time Boost!
•  Use the recip function with the ms function:!
q={!boost b=$recency v=$qq}&
recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
qq=wine
•  Use edismax vs. dismax if possible:!
q=wine&
boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)
•  Recip is a highly tunable function!
–  recip(x,m,a,b) implementing a / (m*x + b)
–  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
17

Tips and Tricks!
•  Boost should be a multiplier on the relevancy score !
•  {!boost b=} syntax confuses the spell checker so you
need to use spellcheck.q to be explicit!
q={!boost b=$recency v=$qq}&spellcheck.q=wine
•  Bottom out the old age penalty using min:!
–  min(recip(…), 0.20)
•  Not a one-size fits all solution – academic research
focused on when to apply it !
19

•  Score based on number of unique views!
•  Not known at indexing time!
•  View count should be broken into time slots!
20
Boost by Popularity!

Solr: ExternalFileField!
In schema.xml:
<fieldType name="externalPopularityScore"
keyField="id"
defVal="1"
stored="false" indexed="false"
class=”solr.ExternalFileField"
valType="pfloat"/>
<field name="popularity"
type="externalPopularityScore" />
22

Popularity Boost: Nuts & Bolts!
23
Logs

Solr
Server

User activity
logged
View
Coun1ng

Job

solr-home/data/
external_popularity
a=1.114
b=1.05
c=1.111
…
commit

Popularity Tips & Tricks
•  For big, high traffic sites, use log analysis!
–  Perfect problem for MapReduce!
–  Take a look at Hive for analyzing large volumes
of log data!
•  Minimum popularity score is 1 (not zero) …
up to 2 or more!
–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth
…)!
•  Watch out for spell checker “buildOnCommit”!
24

Filtering By User Preferences
•  Easy approach is to build basic preference
fields in to the index:!
–  Content types of interest – content_type!
–  High-level categories of interest - category!
–  Source of interest – source!
!
•  We had too many categories and sources that
a user could enable / disable to use basic
filtering!
–  Custom SearchComponent with a connection to a
JDBC DataSource!
25

Preferences Component!
•  Connects to a database!
•  Caches DocIdSet in a Solr FastLRUCache!
•  Cached values marked as dirty using a
simple timestamp passed in the request!
!
Declared in solrconfig.xml:!
<searchComponent !
class=“demo.solr.PreferencesComponent" !
name=”pref">!
<str name="jdbcJndi">jdbc/solr</str> !
</searchComponent>!
26

YOU HAVE QUESTIONS
…. WE HAVE ANSWERS!
ramzi.alqrainy@gmail.com!

References!
•  h5p://wiki.apache.org/solr/

•  h5p://www.lucidworks.com/

•  Apache
Solr
4
Cookbook

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

More Related Content

What's hot

Similar to Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

More from Ramzi Alqrainy

Recently uploaded

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking