Your SlideShare is downloading. ×
0
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Migration from FAST ESP to Solr
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Migration from FAST ESP to Solr

2,338

Published on

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr …

Migration from a Commercial Search Platform (specifically FAST ESP) to Lucene/Solr

Presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global

There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.

This presentation will compare Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We will further explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases will be presented describing how to map the various functions between systems.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,338
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Revolution Conference October 7-8, 2010 Migrating from Fast ESP to Lucene Solr Search Platform Presented by Michael McIntosh 10/8/10 www.tnrglobal.com 1
  • 2. Revolution Conference October 7-8, 2010 Introduction  Michael McIntosh » Search Architect & VP of Enterprise Search at TNR Global » 10+ Years in Search, 15+ Years in Software Development » Core Member of Lycos Search Engine Team (1997-2001)  TNR Global, LLC » Web Development and Search Integration Services » LAMP Stack (Linux, Apache, MySQL, PHP, Python, Perl) » Search Integrators using Java, Python, Ruby and C# » Search Engine (Fast ESP, Solr, OmniFind and More) 10/8/10 www.tnrglobal.com 2
  • 3. Revolution Conference October 7-8, 2010 Agenda  Define Our Challenges  Outline Potential Solution  Identify Core Components  Explore Specific Use Cases  Highlights What Was Learned 10/8/10 www.tnrglobal.com 3
  • 4. Revolution Conference October 7-8, 2010 The Problem  Largest Clients using Fast ESP for Linux  No Future in Fast ESP for Linux Platforms 10/8/10 www.tnrglobal.com 4
  • 5. Revolution Conference October 7-8, 2010 Um, ESP? I think our future together may be in serious jeopardy… 10/8/10 www.tnrglobal.com 5
  • 6. Revolution Conference October 7-8, 2010 The Problems (cont.)  Largest Clients using Fast ESP for Linux  No Future in Fast ESP for Linux Platforms  Lacking Dynamic Fields & Robust Facet Support  Limited Ability Modify Result Ranking Algorithm  Proprietary Code & Limited Community Support 10/8/10 www.tnrglobal.com 6
  • 7. Revolution Conference October 7-8, 2010 The Problems (cont.)  Search Migration Path for ESP Clients  Both Structured & Unstructured Content  Scalable, Fault-Tolerant, Production Quality  Content Taxonomy and Drill-Down Navigation  Web Crawling, HTML & Multi-Page Documents 10/8/10 www.tnrglobal.com 7
  • 8. Revolution Conference October 7-8, 2010 The Solution  Apache Solr Search Platform » Robust and Powerful Search Feature Set » Active and Passionate Development Community » Good Lucene and Solr Development Documentation » Community Experts and Commercial Support Options 10/8/10 www.tnrglobal.com 8
  • 9. Revolution Conference October 7-8, 2010 The Solution (cont.)  Open-Source Tools for Missing Functionality » Pypes - Document-Centric Processing Pipeline » Heritrix - Highly Configurable Web Crawler » Supervisor - Cluster Node Services Controller 10/8/10 www.tnrglobal.com 9
  • 10. Revolution Conference October 7-8, 2010 The Solution (cont.)  ESP Specific Code Migration » Refactor to Decouple Tightly Integrated ESP Code » Utilize RESTful Service Oriented Architecture Solutions » Using CherryPy for Python Based Services » Using Jetty for Java Based Services 10/8/10 www.tnrglobal.com 10
  • 11. Revolution Conference October 7-8, 2010 The Solution (cont.)  Platform Agnostic Code Transition » Content Connectors - Database and XML Data Feeds » Content Transformers - ESP FastXML Readers Available » Content Feeding - Trivial to Import Structured Documents <?xml version="1.0" encoding="UTF-8"?> <documents> <document id="http://www.aperturescience.com/item?id=14602&amp;tsId=1931068"> <element name="catalog_id"><value>1931068</value></element> <element name="catalog_name"><value>Aperture Science Catalog</value></element> <element name="item_id"><value>4096</value></element> <element name="item_category"><value>/Storage/Containers</value></element> <element name="item_name"><value>Companion Cube</value></element> ... </document> ... </documents> 10/8/10 www.tnrglobal.com 11
  • 12. Revolution Conference October 7-8, 2010 Key Migration Concerns  What are deal-breakers for our clients? » Solution MUST support highly structured content » Solution MUST support unstructured web content » Solution MUST support parametric search features » Solution MUST support hierarchal taxonomy faceting » Solution MUST support faceting on dynamic fields » Solution MUST support scalable search/indexing architecture » Solution MUST support fault-tolerance & partial fail-over 10/8/10 www.tnrglobal.com 12
  • 13. Revolution Conference October 7-8, 2010 Key Migration Challenges  Crawling Unstructured Web Content » Millions of documents from 3rd party websites » Mixture of dynamic and static website content » Mixture of very high and very low quality content » Need to Support HTML and PDF at a minimum  Feeding Highly Structured XML Content » Millions of products with domain-specific attributes » Mixture of manually and automatically classified content » Taxonomy and structure in nearly constant state of flux 10/8/10 www.tnrglobal.com 13
  • 14. Revolution Conference October 7-8, 2010 Crawling Web Content with Solr  Heritrix Web Crawler » Internet Archive's Open-Source Web Crawler » Very Powerful and Highly Configurable Features » Can be configured to mimic ESP crawler behaviors » Can cache documents for later content feeding » Already had experience working with this tool 10/8/10 www.tnrglobal.com 14
  • 15. Revolution Conference October 7-8, 2010 Feeding Web Content with Solr  YouSeer API » Open-source search engine framework » Built on top of other open source components » Part of SeerSuite framework. » Utilizes Heritrix for crawling and Solr for indexing » Simple and convenient to use 10/8/10 www.tnrglobal.com 15
  • 16. Revolution Conference October 7-8, 2010 Crawling & Feeding Web Content 10/8/10 www.tnrglobal.com 16
  • 17. Revolution Conference October 7-8, 2010 Feeding Product Content with Solr  Solr supports XML, JSON, CSV out-of-the-box  We already transform content to ESP FastXML  Many options for data import, easily scriptable  ESP prefers denormalized content, Solr does too 10/8/10 www.tnrglobal.com 17
  • 18. Revolution Conference October 7-8, 2010 ESP FastXML Content Example <?xml version="1.0" encoding="UTF-8"?> <documents> <document id="http://www.aperturescience.com/item?id=14602&amp;tsId=1931068"> <element name="catalog_id"><value>1931068</value></element> <element name="catalog_name"><value>Aperture Science Catalog</value></element> <element name="item_id"><value>4096</value></element> <element name="item_category"><value>/Storage/Containers</value></element> <element name="item_name"><value>Companion Cube</value></element> ... </document> <document id="http://www.aperturescience.com/item?id=14647&amp;tsId=193764"> <element name="catalog_id"><value>193764</value></element> <element name="catalog_name"><value>Aperture Science Catalog</value></element> <element name="item_id"><value>2048</value></element> <element name="item_category"><value>/Supplies/Baking</value></element> <element name="item_name"><value>Cake Ingredient #42</value></element> ... </document> ... </documents> 10/8/10 www.tnrglobal.com 18
  • 19. Revolution Conference October 7-8, 2010 Solr Taxonomy Faceting Approach  At initial pass, Solr does not appear to currently support taxonomy faceting » There are several ways around this including patches » It is relatively easy to resolve if taxonomy is shallow » Taxonomy Faceting Support is around the corner Electronics » Camera & Photo » Digital Cameras 10/8/10 www.tnrglobal.com 19
  • 20. Revolution Conference October 7-8, 2010 Our Taxonomy Faceting Approach  We used fields in schema for top-level and second-level taxonomy categories » Top Level Field Named “Family” » Second Level Field Named “Category” » The facet field are selected based upon user-selection » If no family value selected, faceting occurs on family » If family is selected, faceting occurs on category » If family/category selected, no need to taxonomy facet 10/8/10 www.tnrglobal.com 20
  • 21. Revolution Conference October 7-8, 2010 Product Attribute Faceting Approach  We used dynamic fields to store attributes » Attribute name is family_category_attribute=value » We do not facet on attributes until at least Family Selected » During feeding we capture family/category/attribute maps » The front-end leverages f/c/a map to know what to facet » Using this approach, can have preferred attribute field » Only most relevant fields faceted on for each Fam/Cat 10/8/10 www.tnrglobal.com 21
  • 22. Revolution Conference October 7-8, 2010 Solr Migration: Pros / Cons  ESP Features That We Miss… » We miss the really nice administration interface » We miss the really nice monitoring interfaces » We miss the numerous content data connectors » We miss the processing pipeline & doc processors  Solr Features That We Love… » Open-Source, Completely Customizable » Dynamic Fields and Runtime Faceting Support » Active and Passionate Development Community 10/8/10 www.tnrglobal.com 22
  • 23. Revolution Conference October 7-8, 2010 What We Have Learned about Solr…  If you have mostly structured data… » With denormalization, it should be trivial to import » You have many ways to get content into Solr » You overall development time could short » There are a lot of people using Solr in this way  If you have mostly unstructured data… » You need to find a good crawling solution » You will not have all that you need out-of-the-box » Crawling 3rd party content can be a daunting task 10/8/10 www.tnrglobal.com 23
  • 24. Revolution Conference October 7-8, 2010 Questions?  Contact Us! » Website: http://www.tnrglobal.com » E-Mail: info@tnrglobal.com » Phone: 413.425.1499 Thank you for your time! © 2010 TNR Global, LLC. All rights reserved. 10/8/10 www.tnrglobal.com 24

×