Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011


Published on

I outline how to migrate from a commercial search engine solution, FAST ESP, to an open-source solution, Lucene Solr. I discuss how we use Heritrix for scalable web crawling and Pypes for scalable document processing as well as provide an example how you would convert an ESP processor into a Pype processor.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Migration from FAST ESP to Lucene Solr - Apache Lucene Eurocon Barcelona 2011

  1. 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011
  2. 2. What will we cover?Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC.
  3. 3. Who am I?• 7+ Years FAST ESP• 10+ Years in Search• 15+ Years in Software• Early Lycos Developer• I also develop brain-computer interfaces :) © 2011 TNR Global, LLC.
  4. 4. Who are we?• 7+ Years in Search• 15+ Years in Web Dev• 30+ Years in Software• Focus on ESP, Solr, Lucene, and the Cloud• Scalable Web & Search Solution Experts © 2011 TNR Global, LLC.
  5. 5. Migration Overview © 2011 TNR Global, LLC.
  6. 6. Migration Challenges• Our clients depend on ESP 5.3• No future support for Linux ESP• We need a viable exit strategy• We want a fairly painless approach• How do we provide an alternative? © 2011 TNR Global, LLC.
  7. 7. Migration Use Case Federated Product Search ...millions of parts and services...• XML documents (highly-structured)• PDF documents (semi-structured)• HTML documents (unstructured) © 2011 TNR Global, LLC.
  8. 8. Our Approach Solr Search Platform (SolrSP)• Custom Scalable Crawler using Heritrix• Events & Queues managed with RabbitMQ• Caching & Persistence supported via Riak• Python pipeline replacement using Pypes• Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC.
  9. 9. Crawling Content © 2011 TNR Global, LLC.
  10. 10. Crawling for ESP• For XML content, our scripts query a service, download resources and feed• For PDF content, our scripts query a database, download PDF urls and feed• For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC.
  11. 11. Crawling for Solr• For XML & PDF content, the approach remains the same with a different writer• We tried Nutch crawler, but found it challenging to make it do what we needed• We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC.
  12. 12. Crawling with Heritrix• Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides• We wrapped Heritrix to provide a higher level interface for service management• Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC.
  13. 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC.
  14. 14. Processing Content © 2011 TNR Global, LLC.
  15. 15. Processing for ESP ESP Processing is document-centric• For XML, we transform, tag metadata, classify content before indexing• For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing• For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC.
  16. 16. Processing for Solr Solr Processing is field-centric• Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides• Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK)• Hadoop + Cascading looks promising• We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC.
  17. 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC.
  18. 18. Processor Migration ...From ESP © 2011 TNR Global, LLC.
  19. 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC.
  20. 20. Searching Content © 2011 TNR Global, LLC.
  21. 21. Feature Differences• ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting• Solr does most of the heavy lifting at query time, which allows for more flexible approaches• Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories)• Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets• ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC.
  22. 22. Search Interface• Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries• If you don’t have overly complex queries, the edismax query parser looks like a good option• Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that• Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC.
  23. 23. Scaling for Growth © 2011 TNR Global, LLC.
  24. 24. About the hardware...• Solr allows you to use the familiar rows / columns layout ESP uses• Add shards to scale content, add search slaves to scale queries• We’re currently using master/slave indexer/ search setup, but options are numerous• We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC.
  25. 25. Its not just hardware...• Use Fabric to automate cluster installs, data builds and deployment tasks• Use Jenkins to automate, manage and track Fabric tasks• Use Supervisor to manage multiple services running on each node• Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC.
  26. 26. Migration In a Nutshell• We now consider Solr robust enough to be a viable replacement of a FAST ESP solution• You supply the glue, or work with someone like us to tie the different components together• If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration• Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC.
  27. 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC.
  28. 28. Questions?• Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC.