Apache Solr!
Ramzi Alqrainy!
Search Guy!
Part 1!
What !
is Apache Solr ?!
Apache Solr!
!
is!
“ a standalone full-text search server
with Apache Lucene at the backend. “!
!
!
Cont.!
Apache Lucene is a high-performance, full-
featured text search engine library written
entirely in Java. !
!
In brief Apache Solr exposes Lucene's JAVA
API as REST like API's which can be called
over HTTP from any programming
language/platform.!
Why!
use Apache Solr ?!
Features!
l  Full Text Search!
l  Faceted navigation!
l  More items like this(Recommendation)/
Related searches !
l  Spell Suggest/Auto-Complete!
l  Custom document ranking/ordering!
l  Snippet generation/highlighting!
And a lot More....!
Why Solr ?!
Also, Solr is only provides :!
1. Result Grouping / Field Collapsing!
2. Query Elevation!
3. Pivot Facet!
4. Pluggable Search/update Workflow!
5. Hash-Based Duplication!
Field Collapsing	
  
“ Collapses a group of results with the same
field value down to a single (or fixed
number) of entries.”!
For example, most search engines such as
Google collapse on site so only one or two
entries are shown, along with a link to click
to see more results from that site. Field
collapsing can also be used to suppress
duplicate documents.!
Result Grouping 	
  
“ groups documents with a common field
value into groups, returning the top
documents per group, and the top groups
based on what documents are in the groups”!
One example is a search at Best Buy for a
common term such as DVD, that shows the
top 3 results for each category ("TVs &
Video","Movies","Computers", etc)!
Query Elevation	
  
enables you to configure the top results for a
given query regardless of the normal lucene
scoring. This is sometimes called "sponsored
search", "editorial boosting" or "best bets".!
Pivot Facet	
  
You can think of it as "Decision Tree
Faceting" which tells you in advance what
the "next" set of facet results would be for a
field if you apply a constraint from the
current facet results!
Pluggable Search/update Workflow	
  
You can modify the workflow of existing API
endpoints / document instert or updates!
Hash-Based Duplication	
  
Determining the uniqueness of a document
not based on ad ID-Field, but the hash
signature of a field.!
!
Useful for web pages for example, where the
URL may be different but the content the
same.!
Boost documents by age!
•  Just do a descending
sort by age = done?!
•  Boost more recent
d o c u m e n t s a n d
p e n a l i z e o l d e r
documents just for
being old!
•  U s e f u l f o r n e w s ,
business docs, and
local search !
Solr: Indexing!
In schema.xml:
<fieldType name="tdate"
class="solr.TrieDateField"
omitNorms="true"
precisionStep="6"
positionIncrementGap="0"/>
<field name="pubdate"
type="tdate"
indexed="true"
stored="true"
required="true" />
Date published =
DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
FunctionQuery Basics!
•  FunctionQuery: Computes a value for each
document!
– Ranking!
– Sorting!
constant
literal
fieldvalue
ord
rord
sum
sub
product
pow
abs
log
sqrt
map
scale
query
linear
recip
max
min
ms
sqedist - Squared Euclidean Dist
hsin, ghhsin - Haversine Formula
geohash - Convert to geohash
strdist
Solr: Query Time Boost!
•  Use the recip function with the ms function:!
q={!boost b=$recency v=$qq}&
recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&
qq=wine
•  Use edismax vs. dismax if possible:!
q=wine&
boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)
•  Recip is a highly tunable function!
–  recip(x,m,a,b) implementing a / (m*x + b)
–  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age
17
Tune Solr recip function!
18
Tips and Tricks!
•  Boost should be a multiplier on the relevancy score !
•  {!boost b=} syntax confuses the spell checker so you
need to use spellcheck.q to be explicit!
q={!boost b=$recency v=$qq}&spellcheck.q=wine
•  Bottom out the old age penalty using min:!
–  min(recip(…), 0.20)
•  Not a one-size fits all solution – academic research
focused on when to apply it !
19
•  Score based on number of unique views!
•  Not known at indexing time!
•  View count should be broken into time slots!
20
Boost by Popularity!
Popularity Illustrated!
21
Solr: ExternalFileField!
In schema.xml:
<fieldType name="externalPopularityScore"
keyField="id"
defVal="1"
stored="false" indexed="false"
class=”solr.ExternalFileField"
valType="pfloat"/>
<field name="popularity"
type="externalPopularityScore" />
22
Popularity Boost: Nuts & Bolts!
23
Logs	
  
Solr	
  Server	
  
User activity
logged
View	
  Coun1ng	
  
Job	
  
solr-home/data/
external_popularity
a=1.114
b=1.05
c=1.111
…
commit
Popularity Tips & Tricks
•  For big, high traffic sites, use log analysis!
–  Perfect problem for MapReduce!
–  Take a look at Hive for analyzing large volumes
of log data!
•  Minimum popularity score is 1 (not zero) …
up to 2 or more!
–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth
…)!
•  Watch out for spell checker “buildOnCommit”!
24
Filtering By User Preferences
•  Easy approach is to build basic preference
fields in to the index:!
–  Content types of interest – content_type!
–  High-level categories of interest - category!
–  Source of interest – source!
!
•  We had too many categories and sources that
a user could enable / disable to use basic
filtering!
–  Custom SearchComponent with a connection to a
JDBC DataSource!
25
Preferences Component!
•  Connects to a database!
•  Caches DocIdSet in a Solr FastLRUCache!
•  Cached values marked as dirty using a
simple timestamp passed in the request!
!
Declared in solrconfig.xml:!
<searchComponent !
class=“demo.solr.PreferencesComponent" !
name=”pref">!
<str name="jdbcJndi">jdbc/solr</str> !
</searchComponent>!
26
YOU HAVE QUESTIONS
…. WE HAVE ANSWERS!
ramzi.alqrainy@gmail.com!
References!
•  h5p://wiki.apache.org/solr/	
  
•  h5p://www.lucidworks.com/	
  
•  Apache	
  Solr	
  4	
  Cookbook	
  

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

  • 1.
  • 2.
  • 3.
    Apache Solr! ! is! “ astandalone full-text search server with Apache Lucene at the backend. “! ! !
  • 4.
    Cont.! Apache Lucene isa high-performance, full- featured text search engine library written entirely in Java. ! ! In brief Apache Solr exposes Lucene's JAVA API as REST like API's which can be called over HTTP from any programming language/platform.!
  • 5.
  • 6.
    Features! l  Full TextSearch! l  Faceted navigation! l  More items like this(Recommendation)/ Related searches ! l  Spell Suggest/Auto-Complete! l  Custom document ranking/ordering! l  Snippet generation/highlighting! And a lot More....!
  • 7.
    Why Solr ?! Also,Solr is only provides :! 1. Result Grouping / Field Collapsing! 2. Query Elevation! 3. Pivot Facet! 4. Pluggable Search/update Workflow! 5. Hash-Based Duplication!
  • 8.
    Field Collapsing   “Collapses a group of results with the same field value down to a single (or fixed number) of entries.”! For example, most search engines such as Google collapse on site so only one or two entries are shown, along with a link to click to see more results from that site. Field collapsing can also be used to suppress duplicate documents.!
  • 9.
    Result Grouping   “ groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups”! One example is a search at Best Buy for a common term such as DVD, that shows the top 3 results for each category ("TVs & Video","Movies","Computers", etc)!
  • 10.
    Query Elevation   enablesyou to configure the top results for a given query regardless of the normal lucene scoring. This is sometimes called "sponsored search", "editorial boosting" or "best bets".!
  • 11.
    Pivot Facet   Youcan think of it as "Decision Tree Faceting" which tells you in advance what the "next" set of facet results would be for a field if you apply a constraint from the current facet results!
  • 12.
    Pluggable Search/update Workflow   You can modify the workflow of existing API endpoints / document instert or updates!
  • 13.
    Hash-Based Duplication   Determiningthe uniqueness of a document not based on ad ID-Field, but the hash signature of a field.! ! Useful for web pages for example, where the URL may be different but the content the same.!
  • 14.
    Boost documents byage! •  Just do a descending sort by age = done?! •  Boost more recent d o c u m e n t s a n d p e n a l i z e o l d e r documents just for being old! •  U s e f u l f o r n e w s , business docs, and local search !
  • 15.
    Solr: Indexing! In schema.xml: <fieldTypename="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <field name="pubdate" type="tdate" indexed="true" stored="true" required="true" /> Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
  • 16.
    FunctionQuery Basics! •  FunctionQuery:Computes a value for each document! – Ranking! – Sorting! constant literal fieldvalue ord rord sum sub product pow abs log sqrt map scale query linear recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist
  • 17.
    Solr: Query TimeBoost! •  Use the recip function with the ms function:! q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine •  Use edismax vs. dismax if possible:! q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05) •  Recip is a highly tunable function! –  recip(x,m,a,b) implementing a / (m*x + b) –  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age 17
  • 18.
    Tune Solr recipfunction! 18
  • 19.
    Tips and Tricks! • Boost should be a multiplier on the relevancy score ! •  {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit! q={!boost b=$recency v=$qq}&spellcheck.q=wine •  Bottom out the old age penalty using min:! –  min(recip(…), 0.20) •  Not a one-size fits all solution – academic research focused on when to apply it ! 19
  • 20.
    •  Score basedon number of unique views! •  Not known at indexing time! •  View count should be broken into time slots! 20 Boost by Popularity!
  • 21.
  • 22.
    Solr: ExternalFileField! In schema.xml: <fieldTypename="externalPopularityScore" keyField="id" defVal="1" stored="false" indexed="false" class=”solr.ExternalFileField" valType="pfloat"/> <field name="popularity" type="externalPopularityScore" /> 22
  • 23.
    Popularity Boost: Nuts& Bolts! 23 Logs   Solr  Server   User activity logged View  Coun1ng   Job   solr-home/data/ external_popularity a=1.114 b=1.05 c=1.111 … commit
  • 24.
    Popularity Tips &Tricks •  For big, high traffic sites, use log analysis! –  Perfect problem for MapReduce! –  Take a look at Hive for analyzing large volumes of log data! •  Minimum popularity score is 1 (not zero) … up to 2 or more! –  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth …)! •  Watch out for spell checker “buildOnCommit”! 24
  • 25.
    Filtering By UserPreferences •  Easy approach is to build basic preference fields in to the index:! –  Content types of interest – content_type! –  High-level categories of interest - category! –  Source of interest – source! ! •  We had too many categories and sources that a user could enable / disable to use basic filtering! –  Custom SearchComponent with a connection to a JDBC DataSource! 25
  • 26.
    Preferences Component! •  Connectsto a database! •  Caches DocIdSet in a Solr FastLRUCache! •  Cached values marked as dirty using a simple timestamp passed in the request! ! Declared in solrconfig.xml:! <searchComponent ! class=“demo.solr.PreferencesComponent" ! name=”pref">! <str name="jdbcJndi">jdbc/solr</str> ! </searchComponent>! 26
  • 27.
    YOU HAVE QUESTIONS ….WE HAVE ANSWERS! ramzi.alqrainy@gmail.com!
  • 28.
    References! •  h5p://wiki.apache.org/solr/   • h5p://www.lucidworks.com/   •  Apache  Solr  4  Cookbook