Key topics when migrating from FAST to Solr, EuroCon 2010


Published on

Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Key topics when migrating from FAST to Solr, EuroCon 2010

  1. 1. Key topics when Migratng from FAST to Solr By Jan Høydahl cominvent as Apache Lucene EuroCon 05/21/10
  2. 2. Agenda  About Cominvent & Jan Høydahl  Quick overview of FAST ESP  The migraton step by step  Pain points  Q&A Apache Lucene EuroCon 05/21/10
  3. 3. Jan Høydahl: BIO ● Enterprise search consultant since 2000 ● Background in Telecom, Mobile services & sofware development ● Second FAST Global Services engineer ● Founder of Cominvent AS ● Lucid Imaginaton certfed instructor & partner ● FAST Certfed instructor Apache Lucene EuroCon 05/21/10 Logos represent projects I've been involved in, and ™ are © of respectve companies
  4. 4. Cominvent AS: Consultng  Vendor independent search consultng Apache Lucene EuroCon 05/21/10
  5. 5. Cominvent AS: Training  Certfed Solr Training Partner with Lucid Imaginaton  Certfed FAST ESP Training Partner Apache Lucene EuroCon 05/21/10 Photo:
  6. 6. Solr training Oslo June 1-3 Apache Lucene EuroCon 05/21/10
  7. 7. Assumptons  Decision to migrate to Solr is already done  This is not a "sales talk" for any partcular technology  Basic knowledge of Solr  None or limited knowledge of FAST ESP  Migraton to plain Solr or LucidWorks (LucidWorks Enterprise editon not considered) Apache Lucene EuroCon 05/21/10
  8. 8. Introducton to... ...for Solr people Apache Lucene EuroCon 05/21/10
  9. 9. Security Connectors Apache Lucene EuroCon 05/21/10
  10. 10. Apache Lucene EuroCon 05/21/10
  11. 11. FAST ESP architecture Apache Lucene EuroCon 05/21/10 Source:
  12. 12.  Very strong & scalable document processing framework Format Language Linguistic Conversion Detection Normalization Entities Custom Taxonomy Sentiment Ontology Plug-in Search Alert PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. Apache Lucene EuroCon 05/21/10
  13. 13. FAST Document Processors (DP)  DPs transform documents prior to indexing  This is diferent from Solr feld centric analysis  Examples of stages:  Encoding normalizaton, language identfcaton  Text extracton (HTML, PDF, MS Ofce, etc.)  Tokenizaton, lemmatzaton, entty extracton  DPs are chained in pipelines  ESP ships with lots useful DPs and pipelines  Writen in Python, very easy to script new ones Custom Taxonomy Sentiment Ontology Plug-in Apache Lucene EuroCon 05/21/10
  14. 14. Terminology Lucene/Solr FAST Replica Search row Shard Column Facet Navigator Spellcheck Did you mean Update processor Document processor Request Handler Query Transformer (QT) Response Writer Result Processor(RP)/TWM Apache Lucene EuroCon 05/21/10
  15. 15. Terminology Lucene/Solr FAST Schema Index profile Index segment Index partition Lucene IndexWriter/Rdr indexer/fsearch (RTS) ~Multi core ~Multi cluster (Documents receiving same Collection processing) Apache Lucene EuroCon 05/21/10
  16. 16. Important diferences Lucene/Solr FAST Most features query-time Most features index-time Field centric analysis Document centric analysis One language per field Multi lingual fields One Update handler per Format conversion in input type (XML, CSV) document pipeline Slim disk & memory Quite fat disk & memory footprint footprint One Java Web app 15-20 processes Apache Lucene EuroCon 05/21/10
  17. 17. Solr Architecture Thanks to Christan Moen/ATILIKA for graphics Apache Lucene EuroCon 05/21/10
  18. 18. The migraton... Apache Lucene EuroCon 05/21/10
  19. 19. Steps of the migraton  Review current features & architecture  Keep all features? Add new?  Install Solr and do a quick iteraton (1-2 days):  Draf schema.xml & solrconfg.xml  Dump & index some real data  Play around with queries – Solritas is nice here  Design spec covering all migraton areas:  Schema, Content, Feeding & Analysis  Frontends, Querying & API  Admin & Operatonal  Implement :) Apache Lucene EuroCon 05/21/10
  20. 20. Spreadsheet for planning the schema Apache Lucene EuroCon 05/21/10
  21. 21. Migratng index-profle -> Solr schema  ESP index profle -> Solr schema.xml  FAST felds example:  Solr equivalent:  Example: A feld with "tokenize=auto" in FAST → type="text"  Create new <feldType>'s as needed Apache Lucene EuroCon 05/21/10
  22. 22. Product facets & generic felds  With FAST you ofen use «generic1», «generic2» etc to model product facets which may vary between product groups. Front ends need logic to convert. Apache Lucene EuroCon 05/21/10
  23. 23. Product facets & generic felds  With Solr, using dynamic felds, each document can have as many facets you like.  Makes it easy to e.g. Introduce a new «color» facet for cars or a «MegaPixels» facet for digital cameras Apache Lucene EuroCon 05/21/10
  24. 24. Composite felds -> DisMax ReqHandler  FAST uses composite felds to search across multple felds, with weightng defned in Rank Profles  FAST's composite felds & rank profles can be modelled as Solr «DisMax» queries  Set suitable defaults in solrconfg.xml using named requesthandler instances.  In case of many felds & performance issues, use <copyField> to group similarly ranked felds!  Freshness boost, GEO boost etc handled through Functon Queries Apache Lucene EuroCon 05/21/10
  25. 25. Composite felds -> DisMax ReqHandler  Given a FAST composite feld / Rank Profle Apache Lucene EuroCon 05/21/10
  26. 26. Composite felds -> DisMax ReqHandler  This Solr query will do the same, confgureable per query:  qt=dismax  q=oslo  qf=ttle^5.0 teaser^1.5 body^0.1  bf=recip(rord(last_modifed),1,1000,1000) ... ... DisjunctonMaxQuery((teaser:foo^1.5 ||ttle:foo^5.0 ||body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:foo^1.5 ttle:foo^5.0 body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ||ttle:bar^5.0 ||body:bar^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ttle:bar^5.0 body:bar^0.1)~0.01) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) ... ... Apache Lucene EuroCon 05/21/10
  27. 27. Statc document boosts  FAST uses the «hwboost» feld to add a statc Quality boost to each document.  In Solr, you have more fexibility:  Add a boost to each document <doc boost="10.0">  Add a boost to each feld <feld name="ttle" boost="10.0">  Include any numeric document feld in a BoostFuncton bf=sum(sqrt(popularity)^100.0, statcboost^20.0) bf=sum(sqrt(popularity)^100.0, statcboost^20.0) Apache Lucene EuroCon 05/21/10
  28. 28. Navigator statstcs  FAST navigators provide statstcs metadata (min/max/avg/sum)  Soluton: Use the StatsComponent Apache Lucene EuroCon 05/21/10
  29. 29. Navigator auto-buckets  FAST numeric navigators give auto-bucketng based on  equal-frequency, equal-width, manual  Soluton:  Create a new feld which is pre-computed  Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"  Or use facet queries (expensive)  Or implement auto-bucketng and contribute the patch :-) Apache Lucene EuroCon 05/21/10
  30. 30. XRANK  FAST has a feature to boost documents satsfying an "XRANK" sub-query with a certain statc boost  In Solr, you can solve most XRANK use cases using FunctonQueries Apache Lucene EuroCon 05/21/10
  31. 31. Scope search  FAST ofers a feld type which holds arbitrary XML  Search in XPath-style: xml:companies:company:and(revenue:>1000, employees:>=100)  Have not found similar feld type in Lucene.  Anyone? Apache Lucene EuroCon 05/21/10
  32. 32. Migratng Connectors  FAST's connectors are many and mature  For simple use cases, consider Solr's DIH:  Supports DB, RSS, Web-services, Local flesystem  Additonally throgh Lucene Connectors Framework:  EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS  New connectors should be writen for LCF -and be submited back to the community :) Apache Lucene EuroCon 05/21/10
  33. 33. Migratng Web Crawler  FAST's crawler is mature, performing & scalable  Solr has no built-in web crawler  Prepare for a lot of extra work migratng crawler  Alternatves:  The Apache Nutch crawler (steep learning curve)  Apache Droids  Heritx + Solr (example in Solr1.4 book)  OpenPipeline has a (very) simple crawler Apache Lucene EuroCon 05/21/10
  34. 34. Migratng document processing  Solr lacks a sophistcated processing pipeline.  Alternatves:  Solr's UpdateProcessorChain for simple pipelines:  Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)  OpenPipeline for more advanced requirements:  Check out FindWise's talk  Integrated with Solr  LingPipe NamedEnttyExtractor plugin Apache Lucene EuroCon 05/21/10
  35. 35. Document processing examples  Binary documents with metadata  Actual customer request: Enrich library records with PDF content  Use Open Pipeline with Apache Tika processor  Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)  Custom XML using FAST's XMLMapper  DIH's built-in XPath support  XSLT to Solr input XML  Write an new XMLMapper Update Request Handler? Apache Lucene EuroCon 05/21/10
  36. 36. Mult lingual  FAST is state of the art on linguistcs  FAST is language aware, e.g. the ttle feld is "analyzed" depending on detected language  Solr is not language aware  Each feld type has one and only one language  Most common soluton:  One feld type per language: text_no, text_en, text_de  Dynamic felds: <dynamicField name="*_en" type="text_en"..../>  Implement language awareness in applicaton layer (feeding + querying) Apache Lucene EuroCon 05/21/10
  37. 37. Mult lingual – advanced  FAST ships with Lemmatzaton for most languages  Solr ships with Stemming – has limitatons  Solutons for mult lingual needs:  Kstem is tghter. Free with  License 3rd party linguistcs  Example: BasisTech Rosete Linguistc Platorm Lemmatzaton, POS etc.. Apache Lucene EuroCon 05/21/10
  38. 38. Mult lingual – very advanced  FAST allows lemmatzaton by index expansion  This can be useful if your frontend does not know what languages are being queried, as all the word infectons are stored in the index.  There is no soluton for this in Solr today,  Workaround: DisMax query spanning all languages: q=eurocon&qf=text_en^2.0 text_no text_de text_it  Downside: This gets ugly and slow with increasing number of languages Apache Lucene EuroCon 05/21/10
  39. 39. Migratng Front ends / Query  Using a search middleware with Solr support? Lucky you!  If not, consider introducing one now:  Using FAST Java/.NET APIs?  Choose SolrJ or SolrNET/SolrSharp  Query language diferences. &fq= instead of flter()  Solr facets do not require session/state as FAST's Apache Lucene EuroCon 05/21/10
  40. 40. Result views  FAST uses "result-view" and "search profle" to specify what felds to return.  Migrate FAST's «views» into named RequestHandler confgs with all default presets  No need to defne felds to return up-front!, use f=a,b,c... Apache Lucene EuroCon 05/21/10
  41. 41. Operatons  Solr has no central admin-server (untl "SolrCloud")  For GUI installer, use  Multple cores – allows smooth schema upgrade etc.  No built-in query reportng, log analysis or monitoring. But have a look at: Apache Lucene EuroCon 05/21/10
  42. 42. Summary  Many migratons are (quite) straight-forward!  Warning fags  Mult-lingual and advanced linguistcs  Heavy use of Document Processing, including Entty Extracton  Scope search  Other enterprise complexites (security, connectors etc)  Follow a structured process  Quick prototyping  Design spec for each area  Don't forget to analyze logs and measure user satsfacton! Apache Lucene EuroCon 05/21/10
  43. 43. Thank You This presentaton licensed under CC-by-sa license Apache Lucene EuroCon 05/21/10 You must atribute Cominvent with name and link