From Lucene to Solr 4 Trunk

1,032 views
932 views

Published on

SF Bay Area Lucene / Solr Meetup 17 Jan 2013

Use Case - How the SolvNet team migrated from Apache Lucene to Apache Solr 4. This presentation highlighted the major issues and challenges faced in the upgrade including new implementations of Auto-complete, Auto Suggest, Extending the Solr Highlighter, etc. Troy D. Thomas <a>Troy D. Thomas</a>c

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,032
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

From Lucene to Solr 4 Trunk

  1. 1. From Lucene to Solr 4 Trunk We Made It! SF Bay Lucene / Solr Meetup Troy Thomas 17 Jan 2013© Synopsys 2013 1
  2. 2. Lucene to Solr 4 Trunk Agenda • Company - Background • Project Inspiration • Why Solr 4 – Why Trunk? • Architecture (Front to Back) • Trunk to Beta • Future • Demo • Q and A© Synopsys 2013 2
  3. 3. Company - Background Synopsys – What? • Synopsys – 25 year old company / 1.8B 2012 revenue – Electronic Design Automation (EDA) – Electrical engineers design computer chips using Synopsys – Verilog, VHDL - High level design – Simulation – Test – Power – Place and route – IP blocks – Nearly every semiconductor built uses Synopsys… microprocessors, RAM, etc.© Synopsys 2013 3
  4. 4. Company Background Synopsys – SolvNet ® • SolvNet ® - online knowledge base system used by customers and employees – Dedicated engineering team • 20 year history – 1993 Email – 1995 A “Patchy” NCSA Web server + PERL CGI – 1997 Verity Netscape Search – 2001 Java – Netscape Iplanet Portal + Verity – 2005 Apache Lucene – 2007 Pure Apache – 2012 Solr 4© Synopsys 2013 4
  5. 5. Lucene It’s complicated… • Moved to Lucene in 2005 – Custom tokenization helped results – Ex: +delay_mode_zero • Auto-complete function 2008 – Yahoo UI Widget • Tomcat w/ RMI callback • PDF Text extraction using PDFBox • HTML parser • Generate Lucene documents – Add to index • Separate collections – Articles, Docs© Synopsys 2013 5
  6. 6. Project Inspiration Apachecon - Solr • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML,JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture • Solr Uses the Lucene Search Library and Extends it! • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys • Powerful Extensions to the Lucene Query Language • Faceted Search and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • Performance Optimizations • External Configuration via XML • An Administration Interface • Monitorable Logging • Fast Incremental Updates and Index Replication • Highly Scalable Distributed search with sharded index across multiple hosts • XML, CSV/delimited-text, and binary update formats • Easy ways to pull in data from databases and XML files from local disk and HTTP sources • Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika • Multiple search indices© Synopsys 2013 6
  7. 7. Solr 4 Why? • Solr – Faceting – Modernize GUI – Deprecate custom code – Auto-complete using Yahoo UI – Did you mean? – Use Tika for more mime types – ExtractingRequestHandler (Solr Cell) • Solr 4 (trunk) – DirectSolrSpellChecker – More like this – Synonym list – Save migration© Synopsys 2013 7
  8. 8. Front-End Screenshot© Synopsys 2013 8
  9. 9. Front-End Research • How should we build new front-end? – Classic – JSF – JSP / Servlet (MVC) – Leverage framework – Apache Velocity – SolrJ – SolrJS – Myfaces – Ajax Solr© Synopsys 2013 9
  10. 10. Front-End Research • Ajax Solr versus SolrJS – SolrJS (deprecated) – not fully IE 6, 7, 8 compatible – No highlight / sorting support – Ajax Solr – AbstractFacetWidget methods for faceting – AbstractTextWidget – PagerWidget for pagination – AutoComplete – Community weak© Synopsys 2013 10
  11. 11. Front-End Ajax Solr • Ajax Solr – Advantage: Widgets – Save settings – Auto Complete – Query submit – Sort /display results – Pagination – Facet by product – Facet by doc type – JQUERY / JSON friendly – Challenges: – Session management – Proxy solution© Synopsys 2013 11
  12. 12. Front-End Screenshot© Synopsys 2013 12
  13. 13. Front-End Ajax Solr – JSON Object data - Firebug© Synopsys 2013 13
  14. 14. Front-End DirectSolrSpellChecker – Auto Suggest© Synopsys 2013 14
  15. 15. Front-End Extend Solr Highlighter© Synopsys 2013 15
  16. 16. Back-end Tokenization • Carry custom tokenization work forward from Lucene – Change functionality – operator (ex: +delay_mode_zero) • Used text_rev xml configuration to reverse tokens for reverse index feature – Enables wildcard searching in front of string – *lock* *lock clock* – Apache Solr Mailing list community was very helpful© Synopsys 2013 16
  17. 17. Back-end Tokenizer – text_rev configuration <!-- Similar to fieldtype text except text_rev reverses the characters of each token, to enable more efficient leading wildcard queries. --> <fieldType name="text_rev" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer type="index"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <!-- Disable reverse indexing to save disk space and improve speed! --> <!-- filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/--> </analyzer> <analyzer type="query"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer>© Synopsys 2013 17
  18. 18. Back-end Strip out the noise Custom Input Stream Filter – strip out the noise© Synopsys 2013 18
  19. 19. Back-end Sharding • A different way to shard – Many shards mapped to one collection – Shards used for easy maintenance (not performance) – One shard per documentation version (12 total) – One shard for articles – One for release notes – One shard for internal only articles – Full re-index Articles, Release Notes every few hours – Simpler implementation – Index Documentation – as needed© Synopsys 2013 19
  20. 20. Trunk to Beta Minor annoyance • After go live – Solr 4 beta shipped – Minor changes – Tika and Zookeeper upgraded – ContentStreamUpdateRequest.addFile() – addFile(File file) became addFile(File file, String contentType) – New setLuceneMatchVersion – LUCENE_4 – Added to make unit tests work properly • Production remains on Solr 4 beta – Will migrate to Solr 4.1 production mid year© Synopsys 2013 20
  21. 21. Future What remains • More tuning – Human and machine learning approaches • NRT indexing – Use article hits to boost results (Most popular sort) – Leverage article rating data • No SQL like features – Customer profile data© Synopsys 2013 21
  22. 22. Demo© Synopsys 2013 22
  23. 23. Special Thanks… Thank you Chris and Erik - Apachecon 2010© Synopsys 2013 23
  24. 24. Final Thoughts Thank you Lucid Works Thank you for hosting this Meetup and your commitment to the Apache Community…© Synopsys 2013 24
  25. 25. Q and A / Contact Me Questions?© Synopsys 2013 25

×