Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking


Published on

Apache Solr 4 Part 1 - Why use Solr ? How to build Recency Ranking and Popularity Ranking ?

Published in: Technology
  • Be the first to comment

Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

  1. 1. Apache Solr!Ramzi Alqrainy!Search Guy!Part 1!
  2. 2. What !is Apache Solr ?!
  3. 3. Apache Solr!!is!“ a standalone full-text search serverwith Apache Lucene at the backend. “!!!
  4. 4. Cont.!Apache Lucene is a high-performance, full-featured text search engine library writtenentirely in Java. !!In brief Apache Solr exposes Lucenes JAVAAPI as REST like APIs which can be calledover HTTP from any programminglanguage/platform.!
  5. 5. Why!use Apache Solr ?!
  6. 6. Features!l  Full Text Search!l  Faceted navigation!l  More items like this(Recommendation)/Related searches !l  Spell Suggest/Auto-Complete!l  Custom document ranking/ordering!l  Snippet generation/highlighting!And a lot More....!
  7. 7. Why Solr ?!Also, Solr is only provides :!1. Result Grouping / Field Collapsing!2. Query Elevation!3. Pivot Facet!4. Pluggable Search/update Workflow!5. Hash-Based Duplication!
  8. 8. Field Collapsing  “ Collapses a group of results with the samefield value down to a single (or fixednumber) of entries.”!For example, most search engines such asGoogle collapse on site so only one or twoentries are shown, along with a link to clickto see more results from that site. Fieldcollapsing can also be used to suppressduplicate documents.!
  9. 9. Result Grouping  “ groups documents with a common fieldvalue into groups, returning the topdocuments per group, and the top groupsbased on what documents are in the groups”!One example is a search at Best Buy for acommon term such as DVD, that shows thetop 3 results for each category ("TVs &Video","Movies","Computers", etc)!
  10. 10. Query Elevation  enables you to configure the top results for agiven query regardless of the normal lucenescoring. This is sometimes called "sponsoredsearch", "editorial boosting" or "best bets".!
  11. 11. Pivot Facet  You can think of it as "Decision TreeFaceting" which tells you in advance whatthe "next" set of facet results would be for afield if you apply a constraint from thecurrent facet results!
  12. 12. Pluggable Search/update Workflow  You can modify the workflow of existing APIendpoints / document instert or updates!
  13. 13. Hash-Based Duplication  Determining the uniqueness of a documentnot based on ad ID-Field, but the hashsignature of a field.!!Useful for web pages for example, where theURL may be different but the content thesame.!
  14. 14. Boost documents by age!•  Just do a descendingsort by age = done?!•  Boost more recentd o c u m e n t s a n dp e n a l i z e o l d e rdocuments just forbeing old!•  U s e f u l f o r n e w s ,business docs, andlocal search !
  15. 15. Solr: Indexing!In schema.xml:<fieldType name="tdate"class="solr.TrieDateField"omitNorms="true"precisionStep="6"positionIncrementGap="0"/><field name="pubdate"type="tdate"indexed="true"stored="true"required="true" />Date published =DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);
  16. 16. FunctionQuery Basics!•  FunctionQuery: Computes a value for eachdocument!– Ranking!– Sorting!constantliteralfieldvalueordrordsumsubproductpowabslogsqrtmapscalequerylinearrecipmaxminmssqedist - Squared Euclidean Disthsin, ghhsin - Haversine Formulageohash - Convert to geohashstrdist
  17. 17. Solr: Query Time Boost!•  Use the recip function with the ms function:!q={!boost b=$recency v=$qq}&recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)&qq=wine•  Use edismax vs. dismax if possible:!q=wine&boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)•  Recip is a highly tunable function!–  recip(x,m,a,b) implementing a / (m*x + b)–  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age17
  18. 18. Tune Solr recip function!18
  19. 19. Tips and Tricks!•  Boost should be a multiplier on the relevancy score !•  {!boost b=} syntax confuses the spell checker so youneed to use spellcheck.q to be explicit!q={!boost b=$recency v=$qq}&spellcheck.q=wine•  Bottom out the old age penalty using min:!–  min(recip(…), 0.20)•  Not a one-size fits all solution – academic researchfocused on when to apply it !19
  20. 20. •  Score based on number of unique views!•  Not known at indexing time!•  View count should be broken into time slots!20Boost by Popularity!
  21. 21. Popularity Illustrated!21
  22. 22. Solr: ExternalFileField!In schema.xml:<fieldType name="externalPopularityScore"keyField="id"defVal="1"stored="false" indexed="false"class=”solr.ExternalFileField"valType="pfloat"/><field name="popularity"type="externalPopularityScore" />22
  23. 23. Popularity Boost: Nuts & Bolts!23Logs  Solr  Server  User activityloggedView  Coun1ng  Job  solr-home/data/external_popularitya=1.114b=1.05c=1.111…commit
  24. 24. Popularity Tips & Tricks•  For big, high traffic sites, use log analysis!–  Perfect problem for MapReduce!–  Take a look at Hive for analyzing large volumesof log data!•  Minimum popularity score is 1 (not zero) …up to 2 or more!–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth…)!•  Watch out for spell checker “buildOnCommit”!24
  25. 25. Filtering By User Preferences•  Easy approach is to build basic preferencefields in to the index:!–  Content types of interest – content_type!–  High-level categories of interest - category!–  Source of interest – source!!•  We had too many categories and sources thata user could enable / disable to use basicfiltering!–  Custom SearchComponent with a connection to aJDBC DataSource!25
  26. 26. Preferences Component!•  Connects to a database!•  Caches DocIdSet in a Solr FastLRUCache!•  Cached values marked as dirty using asimple timestamp passed in the request!!Declared in solrconfig.xml:!<searchComponent !class=“demo.solr.PreferencesComponent" !name=”pref">!<str name="jdbcJndi">jdbc/solr</str> !</searchComponent>!26
  28. 28. References!•  h5p://  •  h5p://  •  Apache  Solr  4  Cookbook