Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Lucene to Solr 4 Trunk

1,172 views

Published on

SF Bay Area Lucene / Solr Meetup 17 Jan 2013

Use Case - How the SolvNet team migrated from Apache Lucene to Apache Solr 4. This presentation highlighted the major issues and challenges faced in the upgrade including new implementations of Auto-complete, Auto Suggest, Extending the Solr Highlighter, etc. Troy D. Thomas <a>Troy D. Thomas</a>c

Published in: Technology
  • Be the first to comment

From Lucene to Solr 4 Trunk

  1. 1. From Lucene to Solr 4 Trunk We Made It! SF Bay Lucene / Solr Meetup Troy Thomas 17 Jan 2013© Synopsys 2013 1
  2. 2. Lucene to Solr 4 Trunk Agenda • Company - Background • Project Inspiration • Why Solr 4 – Why Trunk? • Architecture (Front to Back) • Trunk to Beta • Future • Demo • Q and A© Synopsys 2013 2
  3. 3. Company - Background Synopsys – What? • Synopsys – 25 year old company / 1.8B 2012 revenue – Electronic Design Automation (EDA) – Electrical engineers design computer chips using Synopsys – Verilog, VHDL - High level design – Simulation – Test – Power – Place and route – IP blocks – Nearly every semiconductor built uses Synopsys… microprocessors, RAM, etc.© Synopsys 2013 3
  4. 4. Company Background Synopsys – SolvNet ® • SolvNet ® - online knowledge base system used by customers and employees – Dedicated engineering team • 20 year history – 1993 Email – 1995 A “Patchy” NCSA Web server + PERL CGI – 1997 Verity Netscape Search – 2001 Java – Netscape Iplanet Portal + Verity – 2005 Apache Lucene – 2007 Pure Apache – 2012 Solr 4© Synopsys 2013 4
  5. 5. Lucene It’s complicated… • Moved to Lucene in 2005 – Custom tokenization helped results – Ex: +delay_mode_zero • Auto-complete function 2008 – Yahoo UI Widget • Tomcat w/ RMI callback • PDF Text extraction using PDFBox • HTML parser • Generate Lucene documents – Add to index • Separate collections – Articles, Docs© Synopsys 2013 5
  6. 6. Project Inspiration Apachecon - Solr • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML,JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture • Solr Uses the Lucene Search Library and Extends it! • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys • Powerful Extensions to the Lucene Query Language • Faceted Search and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • Performance Optimizations • External Configuration via XML • An Administration Interface • Monitorable Logging • Fast Incremental Updates and Index Replication • Highly Scalable Distributed search with sharded index across multiple hosts • XML, CSV/delimited-text, and binary update formats • Easy ways to pull in data from databases and XML files from local disk and HTTP sources • Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika • Multiple search indices© Synopsys 2013 6
  7. 7. Solr 4 Why? • Solr – Faceting – Modernize GUI – Deprecate custom code – Auto-complete using Yahoo UI – Did you mean? – Use Tika for more mime types – ExtractingRequestHandler (Solr Cell) • Solr 4 (trunk) – DirectSolrSpellChecker – More like this – Synonym list – Save migration© Synopsys 2013 7
  8. 8. Front-End Screenshot© Synopsys 2013 8
  9. 9. Front-End Research • How should we build new front-end? – Classic – JSF – JSP / Servlet (MVC) – Leverage framework – Apache Velocity – SolrJ – SolrJS – Myfaces – Ajax Solr© Synopsys 2013 9
  10. 10. Front-End Research • Ajax Solr versus SolrJS – SolrJS (deprecated) – not fully IE 6, 7, 8 compatible – No highlight / sorting support – Ajax Solr – AbstractFacetWidget methods for faceting – AbstractTextWidget – PagerWidget for pagination – AutoComplete – Community weak© Synopsys 2013 10
  11. 11. Front-End Ajax Solr • Ajax Solr – Advantage: Widgets – Save settings – Auto Complete – Query submit – Sort /display results – Pagination – Facet by product – Facet by doc type – JQUERY / JSON friendly – Challenges: – Session management – Proxy solution© Synopsys 2013 11
  12. 12. Front-End Screenshot© Synopsys 2013 12
  13. 13. Front-End Ajax Solr – JSON Object data - Firebug© Synopsys 2013 13
  14. 14. Front-End DirectSolrSpellChecker – Auto Suggest© Synopsys 2013 14
  15. 15. Front-End Extend Solr Highlighter© Synopsys 2013 15
  16. 16. Back-end Tokenization • Carry custom tokenization work forward from Lucene – Change functionality – operator (ex: +delay_mode_zero) • Used text_rev xml configuration to reverse tokens for reverse index feature – Enables wildcard searching in front of string – *lock* *lock clock* – Apache Solr Mailing list community was very helpful© Synopsys 2013 16
  17. 17. Back-end Tokenizer – text_rev configuration <!-- Similar to fieldtype text except text_rev reverses the characters of each token, to enable more efficient leading wildcard queries. --> <fieldType name="text_rev" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer type="index"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <!-- Disable reverse indexing to save disk space and improve speed! --> <!-- filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/--> </analyzer> <analyzer type="query"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer>© Synopsys 2013 17
  18. 18. Back-end Strip out the noise Custom Input Stream Filter – strip out the noise© Synopsys 2013 18
  19. 19. Back-end Sharding • A different way to shard – Many shards mapped to one collection – Shards used for easy maintenance (not performance) – One shard per documentation version (12 total) – One shard for articles – One for release notes – One shard for internal only articles – Full re-index Articles, Release Notes every few hours – Simpler implementation – Index Documentation – as needed© Synopsys 2013 19
  20. 20. Trunk to Beta Minor annoyance • After go live – Solr 4 beta shipped – Minor changes – Tika and Zookeeper upgraded – ContentStreamUpdateRequest.addFile() – addFile(File file) became addFile(File file, String contentType) – New setLuceneMatchVersion – LUCENE_4 – Added to make unit tests work properly • Production remains on Solr 4 beta – Will migrate to Solr 4.1 production mid year© Synopsys 2013 20
  21. 21. Future What remains • More tuning – Human and machine learning approaches • NRT indexing – Use article hits to boost results (Most popular sort) – Leverage article rating data • No SQL like features – Customer profile data© Synopsys 2013 21
  22. 22. Demo© Synopsys 2013 22
  23. 23. Special Thanks… Thank you Chris and Erik - Apachecon 2010© Synopsys 2013 23
  24. 24. Final Thoughts Thank you Lucid Works Thank you for hosting this Meetup and your commitment to the Apache Community…© Synopsys 2013 24
  25. 25. Q and A / Contact Me Questions?© Synopsys 2013 25

×