Key topics when migrating from FAST to Solr, EuroCon 2010

  • 4,567 views
Uploaded on

Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when …

Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,567
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
142
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Key topics when Migratng from FAST to Solr By Jan Høydahl cominvent as Apache Lucene EuroCon 05/21/10
  • 2. Agenda  About Cominvent & Jan Høydahl  Quick overview of FAST ESP  The migraton step by step  Pain points  Q&A Apache Lucene EuroCon 05/21/10
  • 3. Jan Høydahl: BIO ● Enterprise search consultant since 2000 ● Background in Telecom, Mobile services & sofware development ● Second FAST Global Services engineer ● Founder of Cominvent AS ● Lucid Imaginaton certfed instructor & partner ● FAST Certfed instructor Apache Lucene EuroCon 05/21/10 Logos represent projects I've been involved in, and ™ are © of respectve companies
  • 4. Cominvent AS: Consultng  Vendor independent search consultng Apache Lucene EuroCon 05/21/10
  • 5. Cominvent AS: Training  Certfed Solr Training Partner with Lucid Imaginaton  Certfed FAST ESP Training Partner Apache Lucene EuroCon 05/21/10 Photo: fuidpowerzone.com
  • 6. Solr training Oslo June 1-3 Apache Lucene EuroCon 05/21/10
  • 7. Assumptons  Decision to migrate to Solr is already done  This is not a "sales talk" for any partcular technology  Basic knowledge of Solr  None or limited knowledge of FAST ESP  Migraton to plain Solr or LucidWorks (LucidWorks Enterprise editon not considered) Apache Lucene EuroCon 05/21/10
  • 8. Introducton to... ...for Solr people Apache Lucene EuroCon 05/21/10
  • 9. Security Connectors Apache Lucene EuroCon 05/21/10
  • 10. Apache Lucene EuroCon 05/21/10
  • 11. FAST ESP architecture Apache Lucene EuroCon 05/21/10 Source: www.microsof.com
  • 12.  Very strong & scalable document processing framework Format Language Linguistic Conversion Detection Normalization Entities Custom Taxonomy Sentiment Ontology Plug-in Search Alert PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. Apache Lucene EuroCon 05/21/10
  • 13. FAST Document Processors (DP)  DPs transform documents prior to indexing  This is diferent from Solr feld centric analysis  Examples of stages:  Encoding normalizaton, language identfcaton  Text extracton (HTML, PDF, MS Ofce, etc.)  Tokenizaton, lemmatzaton, entty extracton  DPs are chained in pipelines  ESP ships with lots useful DPs and pipelines  Writen in Python, very easy to script new ones Custom Taxonomy Sentiment Ontology Plug-in Apache Lucene EuroCon 05/21/10
  • 14. Terminology Lucene/Solr FAST Replica Search row Shard Column Facet Navigator Spellcheck Did you mean Update processor Document processor Request Handler Query Transformer (QT) Response Writer Result Processor(RP)/TWM Apache Lucene EuroCon 05/21/10
  • 15. Terminology Lucene/Solr FAST Schema Index profile Index segment Index partition Lucene IndexWriter/Rdr indexer/fsearch (RTS) ~Multi core ~Multi cluster (Documents receiving same Collection processing) Apache Lucene EuroCon 05/21/10
  • 16. Important diferences Lucene/Solr FAST Most features query-time Most features index-time Field centric analysis Document centric analysis One language per field Multi lingual fields One Update handler per Format conversion in input type (XML, CSV) document pipeline Slim disk & memory Quite fat disk & memory footprint footprint One Java Web app 15-20 processes Apache Lucene EuroCon 05/21/10
  • 17. Solr Architecture Thanks to Christan Moen/ATILIKA for graphics Apache Lucene EuroCon 05/21/10
  • 18. The migraton... Apache Lucene EuroCon 05/21/10
  • 19. Steps of the migraton  Review current features & architecture  Keep all features? Add new?  Install Solr and do a quick iteraton (1-2 days):  Draf schema.xml & solrconfg.xml  Dump & index some real data  Play around with queries – Solritas is nice here  Design spec covering all migraton areas:  Schema, Content, Feeding & Analysis  Frontends, Querying & API  Admin & Operatonal  Implement :) Apache Lucene EuroCon 05/21/10
  • 20. Spreadsheet for planning the schema Apache Lucene EuroCon 05/21/10
  • 21. Migratng index-profle -> Solr schema  ESP index profle -> Solr schema.xml  FAST felds example:  Solr equivalent:  Example: A feld with "tokenize=auto" in FAST → type="text"  Create new <feldType>'s as needed Apache Lucene EuroCon 05/21/10
  • 22. Product facets & generic felds  With FAST you ofen use «generic1», «generic2» etc to model product facets which may vary between product groups. Front ends need logic to convert. Apache Lucene EuroCon 05/21/10
  • 23. Product facets & generic felds  With Solr, using dynamic felds, each document can have as many facets you like.  Makes it easy to e.g. Introduce a new «color» facet for cars or a «MegaPixels» facet for digital cameras Apache Lucene EuroCon 05/21/10
  • 24. Composite felds -> DisMax ReqHandler  FAST uses composite felds to search across multple felds, with weightng defned in Rank Profles  FAST's composite felds & rank profles can be modelled as Solr «DisMax» queries  Set suitable defaults in solrconfg.xml using named requesthandler instances.  In case of many felds & performance issues, use <copyField> to group similarly ranked felds!  Freshness boost, GEO boost etc handled through Functon Queries Apache Lucene EuroCon 05/21/10
  • 25. Composite felds -> DisMax ReqHandler  Given a FAST composite feld / Rank Profle Apache Lucene EuroCon 05/21/10
  • 26. Composite felds -> DisMax ReqHandler  This Solr query will do the same, confgureable per query:  qt=dismax  q=oslo  qf=ttle^5.0 teaser^1.5 body^0.1  bf=recip(rord(last_modifed),1,1000,1000) ... ... DisjunctonMaxQuery((teaser:foo^1.5 ||ttle:foo^5.0 ||body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:foo^1.5 ttle:foo^5.0 body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ||ttle:bar^5.0 ||body:bar^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 ttle:bar^5.0 body:bar^0.1)~0.01) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed))) ... ... Apache Lucene EuroCon 05/21/10
  • 27. Statc document boosts  FAST uses the «hwboost» feld to add a statc Quality boost to each document.  In Solr, you have more fexibility:  Add a boost to each document <doc boost="10.0">  Add a boost to each feld <feld name="ttle" boost="10.0">  Include any numeric document feld in a BoostFuncton bf=sum(sqrt(popularity)^100.0, statcboost^20.0) bf=sum(sqrt(popularity)^100.0, statcboost^20.0) Apache Lucene EuroCon 05/21/10
  • 28. Navigator statstcs  FAST navigators provide statstcs metadata (min/max/avg/sum)  Soluton: Use the StatsComponent Apache Lucene EuroCon 05/21/10
  • 29. Navigator auto-buckets  FAST numeric navigators give auto-bucketng based on  equal-frequency, equal-width, manual  Soluton:  Create a new feld which is pre-computed  Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"  Or use facet queries (expensive)  Or implement auto-bucketng and contribute the patch :-) Apache Lucene EuroCon 05/21/10
  • 30. XRANK  FAST has a feature to boost documents satsfying an "XRANK" sub-query with a certain statc boost  In Solr, you can solve most XRANK use cases using FunctonQueries Apache Lucene EuroCon 05/21/10
  • 31. Scope search  FAST ofers a feld type which holds arbitrary XML  Search in XPath-style: xml:companies:company:and(revenue:>1000, employees:>=100)  Have not found similar feld type in Lucene.  Anyone? Apache Lucene EuroCon 05/21/10
  • 32. Migratng Connectors  FAST's connectors are many and mature  For simple use cases, consider Solr's DIH:  Supports DB, RSS, Web-services, Local flesystem  Additonally throgh Lucene Connectors Framework:  EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS  New connectors should be writen for LCF -and be submited back to the community :) Apache Lucene EuroCon 05/21/10
  • 33. Migratng Web Crawler  FAST's crawler is mature, performing & scalable  Solr has no built-in web crawler  Prepare for a lot of extra work migratng crawler  Alternatves:  The Apache Nutch crawler (steep learning curve)  Apache Droids  Heritx + Solr (example in Solr1.4 book)  OpenPipeline has a (very) simple crawler Apache Lucene EuroCon 05/21/10
  • 34. Migratng document processing  Solr lacks a sophistcated processing pipeline.  Alternatves:  Solr's UpdateProcessorChain for simple pipelines:  Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)  OpenPipeline for more advanced requirements:  Check out FindWise's talk  Integrated with Solr  LingPipe NamedEnttyExtractor plugin Apache Lucene EuroCon 05/21/10
  • 35. Document processing examples  Binary documents with metadata  Actual customer request: Enrich library records with PDF content  Use Open Pipeline with Apache Tika processor  Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)  Custom XML using FAST's XMLMapper  DIH's built-in XPath support  XSLT to Solr input XML  Write an new XMLMapper Update Request Handler? Apache Lucene EuroCon 05/21/10
  • 36. Mult lingual  FAST is state of the art on linguistcs  FAST is language aware, e.g. the ttle feld is "analyzed" depending on detected language  Solr is not language aware  Each feld type has one and only one language  Most common soluton:  One feld type per language: text_no, text_en, text_de  Dynamic felds: <dynamicField name="*_en" type="text_en"..../>  Implement language awareness in applicaton layer (feeding + querying) Apache Lucene EuroCon 05/21/10
  • 37. Mult lingual – advanced  FAST ships with Lemmatzaton for most languages  Solr ships with Stemming – has limitatons  Solutons for mult lingual needs:  Kstem is tghter. Free with  License 3rd party linguistcs  Example: BasisTech Rosete Linguistc Platorm Lemmatzaton, POS etc.. Apache Lucene EuroCon 05/21/10
  • 38. Mult lingual – very advanced  FAST allows lemmatzaton by index expansion  This can be useful if your frontend does not know what languages are being queried, as all the word infectons are stored in the index.  There is no soluton for this in Solr today,  Workaround: DisMax query spanning all languages: q=eurocon&qf=text_en^2.0 text_no text_de text_it  Downside: This gets ugly and slow with increasing number of languages Apache Lucene EuroCon 05/21/10
  • 39. Migratng Front ends / Query  Using a search middleware with Solr support? Lucky you!  If not, consider introducing one now:  Using FAST Java/.NET APIs?  Choose SolrJ or SolrNET/SolrSharp  Query language diferences. &fq= instead of flter()  Solr facets do not require session/state as FAST's Apache Lucene EuroCon 05/21/10
  • 40. Result views  FAST uses "result-view" and "search profle" to specify what felds to return.  Migrate FAST's «views» into named RequestHandler confgs with all default presets  No need to defne felds to return up-front!, use f=a,b,c... Apache Lucene EuroCon 05/21/10
  • 41. Operatons  Solr has no central admin-server (untl "SolrCloud")  For GUI installer, use  Multple cores – allows smooth schema upgrade etc.  No built-in query reportng, log analysis or monitoring. But have a look at: Apache Lucene EuroCon 05/21/10
  • 42. Summary  Many migratons are (quite) straight-forward!  Warning fags  Mult-lingual and advanced linguistcs  Heavy use of Document Processing, including Entty Extracton  Scope search  Other enterprise complexites (security, connectors etc)  Follow a structured process  Quick prototyping  Design spec for each area  Don't forget to analyze logs and measure user satsfacton! Apache Lucene EuroCon 05/21/10
  • 43. Thank You www.cominvent.com jh@cominvent.com www.twiter.com/cominvent linkedin.com/in/janhoy This presentaton licensed under CC-by-sa license Apache Lucene EuroCon 05/21/10 You must atribute Cominvent with name and link