Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Upcoming SlideShare
Loading in...5
×
 

Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

on

  • 2,799 views

Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're ...

Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.

Statistics

Views

Total Views
2,799
Slideshare-icon Views on SlideShare
2,605
Embed Views
194

Actions

Likes
1
Downloads
32
Comments
0

3 Embeds 194

http://oslo-enterprise-search.tumblr.com 187
http://www.slideshare.net 6
http://www.lmodules.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl Presentation Transcript

    • cominvent as Enterprise Search Specialists Migrating FAST to Solr By Jan Høydahl Oslo Enterprise Search MeetUp May 2010 cominvent as
    • Jan Høydahl ● IT architect - search, telecom, mobile ● Helped build FAST's Global Services as first engineer ● Founder of Cominvent AS ● Search consultant 10 years cominvent as
    • cominvent as cominvent as
    • Consulting – Cominvent delivers independent search consulting – Focus on Apache Lucene/Solr & Microsoft FAST ESP Idea –> architecture –> implementation cominvent as
    • Commercial Support (Solr/Lucene) – When community & mailing list support is not enough.. – Paid support agreement for Apache Solr/Lucene – In cooperation with Lucid Imagination – Read more: http://www.cominvent.com/support/ cominvent as
    • Training – Cominvent AS delivers training public and on-site – Certified Solr Training Partner for Lucid Imagination – Certified FAST ESP Training Partner – Read more: http://www.cominvent.com/training/ cominvent as Photo: fluidpowerzone.com
    • Solr kurs cominvent as
    • cominvent as
    • FAST & Solr are very similar... cominvent as
    • Areas of usage cominvent as
    • Common features cominvent as
    • Common features cominvent as
    • Introduction to... ...for FAST people cominvent as
    • Apache Solr - characteristics Search server (Commercially friendly) cominvent as
    • Apache Solr - characteristics Modular Community Contributions & patches Light weight cominvent as
    • Solr-user community growth Solr-user growth 1600 1400 1200 1000 Messages 800 Column B 600 400 200 0 2006 Mar 2006 Jul 2006 Nov 2007 Mar 2007 Jul 2007 Nov 2008 Mar 2008 Jul 2008 Nov 2009 Apr 2009 Aug 2009 Dec 2006 Jan 2006 May 2006 Sep 2007 Jan 2007 May 2007 Sep 2008 Jan 2008 May 2008 Sep 2009 Feb 2009 Jun 2009 Oct 2010 Feb cominvent as Month
    • Lucene/Solr deployments – More: http://wiki.apache.org/solr/PublicServers cominvent as Thanks to Lucid Imagination for logo collection
    • XML/HTTP 8
    • Solr Architecture cominvent as
    • The Apache Software Foundation cominvent as
    • Other ASF Lucene sub-projects – Lucene Java library – Rich document extraction – Crawling web pages – Machine learning • Classification/clustering • Collaborative filtering... cominvent as
    • Solr in media & newspapers – News search. Also exposes API – Danish news search – Swedish news search – Swedish news search – Faceted search through classifieds – Eastern european classifieds cominvent as
    • Introduction to... ...for Solr people cominvent as
    • FAST ESP – characteristics & key strengths Security Connectors cominvent as
    • FAST ESP – characteristics & key strengths cominvent as
    • FAST ESP – characteristics & key strengths – Very strong document processing framework Format Language Linguistic Conversion Detection Normalization Entities Custom Taxonomy Sentiment Ontology Plug-in PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, Search Alert brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. A first round loser last year, Williams is hoping to progress cominvent as beyond the quarter-finals for the first time in her career.
    • FAST ESP architecture cominvent as
    • The migration... cominvent as
    • Other successful FAST-Solr migrations – Human Rights search • hurisearch.org (blog) – FINN katalog (former Sesam) • katalog.finn.no (announce) – Mocality – African business search • mocality.co.ke (linkedin) – International library search • Large multi-lingual index – Norwegian media house cominvent as • Multiple newspapers
    • Migration objectives – Possible objectives include: • Lower maintenance cost • Deeper in-house competency • Less dependent on external consultants • Ownership and visibility of source code • Shorter time to market for new features • Bugs fixed faster – or even fix ourselves • Larger community, mailing lists that work! • More choice in external consultants • Contribute back to Open Source • Lower HW footprint cominvent as
    • Migration steps – Knowledge gathering & Training – Review current features & arch • Want to keep all features? Add new? – Migration areas: • Index profile • Content • Feeding • Document Processing • Querying • Search middleware? • Admin & Operational – What to do in Application space vs Search space? cominvent as
    • Feature comparison ESP – Solr (similarities) Feature ESP Solr Full-text, boolean, range search, Yes Yes sorting, sub-second, facets, did-you- mean, synonyms, faceting Scaling for QPS Add rows Add rows Scaling for document volume Add columns Add shards Synonyms Index/query side Index/query side GEO search Yes Yes (1.5) Boolean query language Yes (FQL) Yes (Lucene or (e)DisMax) APIs HTTP, Java, .NET, HTTP, Java, .NET, C++, PHP Ruby, Python, PHP, Perl, JS cominvent as
    • Feature comparison ESP – Solr (differences) Feature ESP Solr Admin server Yes No (coming 1.5) Processes Many (C++, Java, One WAR in Java Python) app-server, 100% Java Navigators / Facets Index-time Query-time Did-you-mean Dictionary based Dictionary or index based Feeding API only HTTP POST or API Document processing Pipeline (py) Simple pipeline (Java, JS, Groovy, Jython, JRuby..) Multi field querying Composite fields DisMax handler cominvent as
    • Feature comparison ESP – Solr (differences) Feature ESP Solr Relevancy tuning Rank profiles, term Dynamic function boosting queries and boost functions XRANK XRANK operator Function Queries Freshness boost Freshness in rank Function Queries profile Boost GEO distance Rank profile and Function Queries special Major schema or software updates Cold update, use Stage new content stage environment into new Solr core Pluggability Docprocs, QT/RP Everything :) (limited), clients Request Handlers, Query Parsers, Docprocs, Rank, Spell, tokenizer++ cominvent as
    • Feature comparison ESP – Solr (differences) Feature ESP Solr Lemmatization Can be licensed Can be licensed for many from 3rd party languages Query syntax and(a:foo, b:bar) a:foo OR b:bar i:range(0, 100) I:[0 TO 100] d:range(2000-01- d:[2000-01- 01T00:00:00, 01T00:00:00Z TO 2010-03- NOW] 03T12:00:00) Query params query= q= offset= start= hits= rows= spell=1 spellcheck=true What fields to return view=viewname fl=title,price,body... cominvent as
    • Feature comparison ESP – Solr (differences) Feature ESP Solr Search XML hierarchy Yes, scope search No Reports Built in analytics Use 3rd party log analysis such as Splunk.com cominvent as
    • Your existing FAST system - overview Your web-app Search middleware? cominvent as Graphics diagram: www.microsoft.com
    • Migrating index profile – ESP index profile -> Solr schema.xml – Setup field types, use defaults or create your own – Setup the static fields. ESP: – Solr equivalent: – No need for generic*, use dynamic fields: cominvent as
    • Migrating index profile – Composite fields? • Solr can use <copyField> to copy multiple fields into one, e.g. as we did to map many attributes into one field • However, to achieve ranking with different boost of each field, Solr does not need composite field. Use DisMax query handler instead. Very powerful! – No need to edit schema to add new fields. Using dynamic fields, it is easy to e.g. Introduce a color facet for cars or a Mpixels facet for digital cameras cominvent as
    • DisMax query example – This Solr query can replace use of composite-field • qt=dismax • q=oslo • qf=title^0.7 highpriorityfields^1.5 mediumpriorityfields^0.6 lowpriorityfields^0.2 recallfields^0.0 body^0.0 • bf=recip(rord(creationDate),1,1000,1000) cominvent as
    • Migrating content – If using FAST ContentAPI to push programatically • Use Solr's clients (Java, .NET, Ruby, Python, PHP...) – If feeding FastXML using FileTraverser • Feed as Solr XML using HTTP POST or a POST client – If you feed custom XML with XMLMapper • Have a look at DIH's import and mapping features cominvent as
    • Push Feeding example – Feed XML using HTTP POST: • curl http://localhost:8080/solr/update?commit=true -H "Content-Type: text/xml" --data-binary @mydoc.xml – Ruby example: • >gem sources -a http://gemcutter.org >sudo gem install rsolr require 'rsolr' solr = RSolr.connect :url=>'http://localhost:8080' documents = [{:id=>1, :price=>1.00}, {:id=>2, :price=>10.50}] solr.add documents solr.commit cominvent as
    • Pull: DataImportHandler (DIH) cominvent as
    • Querying examples – http://localhost:8080/solr/select?q=car&fl=id,title – Ruby • res=solr.select :q=>'roses', :fq=>['red','white'] res['response']['docs'].each do |doc| puts doc['title'] end cominvent as
    • Migrating document processing – Solr lacks a sophisticated pipeline with entity extraction etc. Alternatives: • Do extraction in Application space (Ruby) • Write own stage in Solr pipeline for simple cases • Integrate to do more advanced stuff – Matchers/extractors • LingPipe NamedEntityExtractor inside of OpenPipeline – Synonyms: • Use Solr's synonym handling index/query side – Custom stages: • Write a Solr UpdateProcessor (in Java, Jython etc) – Got a LOT of custom FAST docproc stages? • Have a look at SESAT's PY ProcServer for Solr (GPL) cominvent as
    • Migrating linguistics (lemmatization) – Solr ships with Stemming instead of Lemmatization – Stemming has limitations • Biler, bilen, bilene -> bil BUT • Bøker, bøkene -> bøk; boka, bok -> bok – Kstem better. Free with LucidWorks for Solr – If you need singular/plural handling only • Free dictionaries? Check lucene-hunspell – Lemmatization can be licensed from 3rd party such as Basistech, who also has language identification & entity extraction – Language identification also from Sematext cominvent as
    • Basistech Rosette for Lucene – High-end linguistics capabilities for 19 languages – Language Identification – Segmentation and tokenization – Lemmatization – Noun decompounding – Part-of-speech tagging – Entity extraction – Easily integrated with Lucene/Solr – More: http://www.basistech.com/lucene/ cominvent as
    • Migrating search middleware – Using FAST Unity? • Consider migrating middleware logic such as external source querying and federation to SESAT (AGPL) – Using Comperio Front? • Ask Comperio for Solr engine support • Or migrate custom Q&R formats – Or is plain Solr enough? • Solr has built-in support for shards • A shard query will query multiple shards and merge the results into one • Add custom processing as Query Components in Solr • Check contrib & patches! cominvent as
    • Migrating Front ends – Using a middleware with Solr support? Lucky you! – If not, consider introducing one now. Look at: – If you decide to migrate from FAST Java/.NET APIs • Choose SolrJ or SolrNET • Query language differences. &fq= instead of filter() • Solr facets do not require sessions/state as FAST's – Migrate fast's «views» into named ReqHandler configs – Multi lingual: Need to handle title_no, title_en etc... :( cominvent as
    • Migrating Web Crawler – Solr has no built-in web crawler • Instead you can choose from several integrations – The Apache Nutch crawler • Proven with hundreds of millions of pages • http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ – Apache Droids • Still an incubator, but aims at becoming a full crawler • http://incubator.apache.org/droids/ – Heritix + Solr (example in Solr1.4 book) – OpenPipeline has a (very) simple crawler – Lucene Connectors Framework • Preparing crawler support cominvent as
    • Migrating Connectors – Solr handles these sources internally through DIH: • Database, RSS, Web-services, Local filesystem – Additionally throgh Lucene Connectors Framework: • • EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS • New connectors should be written for LCF – Another option: • • Sharepoint, IMAP, Documentum, Vignette, Filesystem cominvent as
    • Operations – Solr has no admin-server (coming in 1.5) – Possible to run multiple Tomcat on same server – Multiple cores in same Tomcat – easier migration – No built-in query reports, use 3rd party tools – No built-in monitoring, have a look at – Log analysis? Check out cominvent as
    • More info cominvent as
    • Thank You www.cominvent.com jh@cominvent.com www.twitter.com/cominvent linkedin.com/in/janhoy This presentation licensed under CC-by-sa license cominvent as You must attribute Cominvent with name and link