Migration from FAST ESP to                                    Lucene Solr                                   Presented by M...
What will we cover?                Core Aspects of ESP to Solr Migration                            Migration Overview    ...
Who am I?                    • 7+ Years FAST ESP                    • 10+ Years in Search                    • 15+ Years i...
Who are we?                    • 7+ Years in Search                    • 15+ Years in Web Dev                    • 30+ Yea...
Migration Overview                                         © 2011 TNR Global, LLC.Wednesday, October 19, 11
Migration Challenges                    • Our clients depend on ESP 5.3                    • No future support for Linux E...
Migration Use Case                            Federated Product Search                            ...millions of parts and...
Our Approach                            Solr Search Platform (SolrSP)                    • Custom Scalable Crawler using H...
Crawling Content                                        © 2011 TNR Global, LLC.Wednesday, October 19, 11
Crawling for ESP                    • For XML content, our scripts query a                            service, download re...
Crawling for Solr                    • For XML & PDF content, the approach                            remains the same wit...
Crawling with Heritrix                    • Heritrix, created by the Internet Archive,                            supports...
Crawler Architecture                            Crawl Job        Crawler                             Request         Manag...
Processing Content                                         © 2011 TNR Global, LLC.Wednesday, October 19, 11
Processing for ESP                            ESP Processing is document-centric                    • For XML, we transfor...
Processing for Solr                              Solr Processing is field-centric                    • Solr analyzers work ...
Processing with Pypes                              •   Written in Python                              •   Easy stage migra...
Processor Migration                                ...From ESP                                   © 2011 TNR Global, LLC.We...
Processor Migration                                ...to Pypes                                  © 2011 TNR Global, LLC.Wed...
Searching Content                                        © 2011 TNR Global, LLC.Wednesday, October 19, 11
Feature Differences                    •       ESP has robust faceting support but facets must be                         ...
Search Interface                    •       Solr has no direct equivalent to FAST Query                            Languag...
Scaling for Growth                                         © 2011 TNR Global, LLC.Wednesday, October 19, 11
About the hardware...                    • Solr allows you to use the familiar rows /                            columns l...
Its not just hardware...                    • Use Fabric to automate cluster installs, data                            bui...
Migration In a Nutshell                    •       We now consider Solr robust enough to be a                            v...
Resources                       Lucid Works   http://www.lucidimagination.com/                         Rosette     http://...
Questions?                    • Contact Us!                     • Website: http://www.tnrglobal.com                     • ...
Upcoming SlideShare
Loading in …5
×

Migration from Fast ESP to Lucene Solr - Michael McIntosh

2,134 views
2,031 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

This presentation will discuss migration from FAST ESP to a Lucene Solr search platform. Illustrated through actual case studies, the presentation will include challenges and concerns, and present solutions and work-arounds to overcome migration issues. There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company's purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,134
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Migration from Fast ESP to Lucene Solr - Michael McIntosh

  1. 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011Wednesday, October 19, 11
  2. 2. What will we cover? Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC.Wednesday, October 19, 11
  3. 3. Who am I? • 7+ Years FAST ESP • 10+ Years in Search • 15+ Years in Software • Early Lycos Developer • I also develop brain-computer interfaces :) © 2011 TNR Global, LLC.Wednesday, October 19, 11
  4. 4. Who are we? • 7+ Years in Search • 15+ Years in Web Dev • 30+ Years in Software • Focus on ESP, Solr, Lucene, and the Cloud • Scalable Web & Search Solution Experts © 2011 TNR Global, LLC.Wednesday, October 19, 11
  5. 5. Migration Overview © 2011 TNR Global, LLC.Wednesday, October 19, 11
  6. 6. Migration Challenges • Our clients depend on ESP 5.3 • No future support for Linux ESP • We need a viable exit strategy • We want a fairly painless approach • How do we provide an alternative? © 2011 TNR Global, LLC.Wednesday, October 19, 11
  7. 7. Migration Use Case Federated Product Search ...millions of parts and services... • XML documents (highly-structured) • PDF documents (semi-structured) • HTML documents (unstructured) © 2011 TNR Global, LLC.Wednesday, October 19, 11
  8. 8. Our Approach Solr Search Platform (SolrSP) • Custom Scalable Crawler using Heritrix • Events & Queues managed with RabbitMQ • Caching & Persistence supported via Riak • Python pipeline replacement using Pypes • Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC.Wednesday, October 19, 11
  9. 9. Crawling Content © 2011 TNR Global, LLC.Wednesday, October 19, 11
  10. 10. Crawling for ESP • For XML content, our scripts query a service, download resources and feed • For PDF content, our scripts query a database, download PDF urls and feed • For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC.Wednesday, October 19, 11
  11. 11. Crawling for Solr • For XML & PDF content, the approach remains the same with a different writer • We tried Nutch crawler, but found it challenging to make it do what we needed • We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC.Wednesday, October 19, 11
  12. 12. Crawling with Heritrix • Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides • We wrapped Heritrix to provide a higher level interface for service management • Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC.Wednesday, October 19, 11
  13. 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC.Wednesday, October 19, 11
  14. 14. Processing Content © 2011 TNR Global, LLC.Wednesday, October 19, 11
  15. 15. Processing for ESP ESP Processing is document-centric • For XML, we transform, tag metadata, classify content before indexing • For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing • For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC.Wednesday, October 19, 11
  16. 16. Processing for Solr Solr Processing is field-centric • Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides • Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK) • Hadoop + Cascading looks promising • We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC.Wednesday, October 19, 11
  17. 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC.Wednesday, October 19, 11
  18. 18. Processor Migration ...From ESP © 2011 TNR Global, LLC.Wednesday, October 19, 11
  19. 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC.Wednesday, October 19, 11
  20. 20. Searching Content © 2011 TNR Global, LLC.Wednesday, October 19, 11
  21. 21. Feature Differences • ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting • Solr does most of the heavy lifting at query time, which allows for more flexible approaches • Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories) • Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets • ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC.Wednesday, October 19, 11
  22. 22. Search Interface • Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries • If you don’t have overly complex queries, the edismax query parser looks like a good option • Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that • Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC.Wednesday, October 19, 11
  23. 23. Scaling for Growth © 2011 TNR Global, LLC.Wednesday, October 19, 11
  24. 24. About the hardware... • Solr allows you to use the familiar rows / columns layout ESP uses • Add shards to scale content, add search slaves to scale queries • We’re currently using master/slave indexer/ search setup, but options are numerous • We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC.Wednesday, October 19, 11
  25. 25. Its not just hardware... • Use Fabric to automate cluster installs, data builds and deployment tasks • Use Jenkins to automate, manage and track Fabric tasks • Use Supervisor to manage multiple services running on each node • Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC.Wednesday, October 19, 11
  26. 26. Migration In a Nutshell • We now consider Solr robust enough to be a viable replacement of a FAST ESP solution • You supply the glue, or work with someone like us to tie the different components together • If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration • Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC.Wednesday, October 19, 11
  27. 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC.Wednesday, October 19, 11
  28. 28. Questions? • Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC.Wednesday, October 19, 11

×