Your SlideShare is downloading. ×
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

2,072

Published on

Apache Solr 4 Part 1 - Why use Solr ? How to build Recency Ranking and Popularity Ranking ?

Apache Solr 4 Part 1 - Why use Solr ? How to build Recency Ranking and Popularity Ranking ?

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,072
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Solr!Ramzi Alqrainy!Search Guy!Part 1!
  • 2. What !is Apache Solr ?!
  • 3. Apache Solr!!is!“ a standalone full-text search serverwith Apache Lucene at the backend. “!!!
  • 4. Cont.!Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java. !!In brief Apache Solr exposes Lucenes JAVAAPI as REST like APIs which can be calledover HTTP from any programminglanguage/platform.!
  • 5. Why!use Apache Solr ?!
  • 6. Features!l  Full Text Search!l  Faceted navigation!l  More items like this(Recommendation)/Related searches !l  Spell Suggest/Auto-Complete!l  Custom document ranking/ordering!l  Snippet generation/highlighting!And a lot More....!
  • 7. Why Solr ?!Also, Solr is only provides :!1. Result Grouping / Field Collapsing!2. Query Elevation!3. Pivot Facet!4. Pluggable Search/update Workflow!5. Hash-Based Duplication!
  • 8. Field Collapsing  “ Collapses a group of results with the samefield value down to a single (or fixednumber) of entries.”!For example, most search engines such asGoogle collapse on site so only one or twoentries are shown, along with a link to clickto see more results from that site. Fieldcollapsing can also be used to suppressduplicate documents.!
  • 9. Result Grouping  “ groups documents with a common fieldvalue into groups, returning the topdocuments per group, and the top groupsbased on what documents are in the groups”!One example is a search at Best Buy for acommon term such as DVD, that shows thetop 3 results for each category ("TVs &Video","Movies","Computers", etc)!
  • 10. Query Elevation  enables you to configure the top results for agiven query regardless of the normal lucenescoring. This is sometimes called "sponsoredsearch", "editorial boosting" or "best bets".!
  • 11. Pivot Facet  You can think of it as "Decision TreeFaceting" which tells you in advance whatthe "next" set of facet results would be for afield if you apply a constraint from thecurrent facet results!
  • 12. Pluggable Search/update Workflow  You can modify the workflow of existing APIendpoints / document instert or updates!
  • 13. Hash-Based Duplication  Determining the uniqueness of a documentnot based on ad ID-Field, but the hashsignature of a field.!!Useful for web pages for example, where theURL may be different but the content thesame.!
  • 14. Boost documents by age!•  Just do a descendingsort by age = done?!•  Boost more recentd o c u m e n t s a n dp e n a l i z e o l d e rdocuments just forbeing old!•  U s e f u l f o r n e w s ,business docs, andlocal search !
  • 15. Solr: Indexing!In schema.xml:<fieldType name="tdate"class="solr.TrieDateField"omitNorms="true"precisionStep="6"positionIncrementGap="0"/><field name="pubdate"type="tdate"indexed="true"stored="true"required="true" />Date published =DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
  • 16. FunctionQuery Basics!•  FunctionQuery: Computes a value for eachdocument!– Ranking!– Sorting!constantliteralfieldvalueordrordsumsubproductpowabslogsqrtmapscalequerylinearrecipmaxminmssqedist - Squared Euclidean Disthsin, ghhsin - Haversine Formulageohash - Convert to geohashstrdist
  • 17. Solr: Query Time Boost!•  Use the recip function with the ms function:!q={!boost b=$recency v=$qq}&recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&qq=wine•  Use edismax vs. dismax if possible:!q=wine&boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)•  Recip is a highly tunable function!–  recip(x,m,a,b) implementing a / (m*x + b)–  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age17
  • 18. Tune Solr recip function!18
  • 19. Tips and Tricks!•  Boost should be a multiplier on the relevancy score !•  {!boost b=} syntax confuses the spell checker so youneed to use spellcheck.q to be explicit!q={!boost b=$recency v=$qq}&spellcheck.q=wine•  Bottom out the old age penalty using min:!–  min(recip(…), 0.20)•  Not a one-size fits all solution – academic researchfocused on when to apply it !19
  • 20. •  Score based on number of unique views!•  Not known at indexing time!•  View count should be broken into time slots!20Boost by Popularity!
  • 21. Popularity Illustrated!21
  • 22. Solr: ExternalFileField!In schema.xml:<fieldType name="externalPopularityScore"keyField="id"defVal="1"stored="false" indexed="false"class=”solr.ExternalFileField"valType="pfloat"/><field name="popularity"type="externalPopularityScore" />22
  • 23. Popularity Boost: Nuts & Bolts!23Logs  Solr  Server  User activityloggedView  Coun1ng  Job  solr-home/data/external_popularitya=1.114b=1.05c=1.111…commit
  • 24. Popularity Tips & Tricks•  For big, high traffic sites, use log analysis!–  Perfect problem for MapReduce!–  Take a look at Hive for analyzing large volumesof log data!•  Minimum popularity score is 1 (not zero) …up to 2 or more!–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth…)!•  Watch out for spell checker “buildOnCommit”!24
  • 25. Filtering By User Preferences•  Easy approach is to build basic preferencefields in to the index:!–  Content types of interest – content_type!–  High-level categories of interest - category!–  Source of interest – source!!•  We had too many categories and sources thata user could enable / disable to use basicfiltering!–  Custom SearchComponent with a connection to aJDBC DataSource!25
  • 26. Preferences Component!•  Connects to a database!•  Caches DocIdSet in a Solr FastLRUCache!•  Cached values marked as dirty using asimple timestamp passed in the request!!Declared in solrconfig.xml:!<searchComponent !class=“demo.solr.PreferencesComponent" !name=”pref">!<str name="jdbcJndi">jdbc/solr</str> !</searchComponent>!26
  • 27. YOU HAVE QUESTIONS…. WE HAVE ANSWERS!ramzi.alqrainy@gmail.com!
  • 28. References!•  h5p://wiki.apache.org/solr/  •  h5p://www.lucidworks.com/  •  Apache  Solr  4  Cookbook  

×