From Lucene to Solr 4 Trunk

                               We Made It!
                       SF Bay Lucene / Solr Meetup

                                Troy Thomas
                                 17 Jan 2013



© Synopsys 2013    1
Lucene to Solr 4 Trunk
 Agenda
 •    Company - Background
 •    Project Inspiration
 •    Why Solr 4 – Why Trunk?
 •    Architecture (Front to Back)
 •    Trunk to Beta
 •    Future
 •    Demo
 •    Q and A




© Synopsys 2013   2
Company - Background
 Synopsys – What?
 • Synopsys – 25 year old company / 1.8B 2012 revenue
      – Electronic Design Automation (EDA)
      – Electrical engineers design computer chips using Synopsys
           – Verilog, VHDL - High level design
           – Simulation
           – Test
           – Power
           – Place and route
           – IP blocks


      – Nearly every semiconductor built uses Synopsys…
        microprocessors, RAM, etc.



© Synopsys 2013   3
Company Background
 Synopsys – SolvNet ®
 • SolvNet ® - online knowledge base system used by
   customers and employees
      – Dedicated engineering team
 • 20 year history
      –    1993 Email
      –    1995 A “Patchy” NCSA Web server + PERL CGI
      –    1997 Verity Netscape Search
      –    2001 Java – Netscape Iplanet Portal + Verity
      –    2005 Apache Lucene
      –    2007 Pure Apache
      –    2012 Solr 4



© Synopsys 2013   4
Lucene
 It’s complicated…
 • Moved to Lucene in 2005
      – Custom tokenization helped results
           – Ex: +delay_mode_zero
 • Auto-complete function 2008
      – Yahoo UI Widget
 •    Tomcat w/ RMI callback
 •    PDF Text extraction using PDFBox
 •    HTML parser
 •    Generate Lucene documents
      – Add to index
 • Separate collections – Articles, Docs

© Synopsys 2013   5
Project Inspiration
 Apachecon - Solr
 •    Advanced Full-Text Search Capabilities
 •    Optimized for High Volume Web Traffic
 •    Standards Based Open Interfaces - XML,JSON and HTTP
 •    Comprehensive HTML Administration Interfaces
 •    Server statistics exposed over JMX for monitoring
 •    Scalability - Efficient Replication to other Solr Search Servers
 •    Flexible and Adaptable with XML configuration
 •    Extensible Plugin Architecture
 •    Solr Uses the Lucene Search Library and Extends it!
 •    A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
 •    Powerful Extensions to the Lucene Query Language
 •    Faceted Search and Filtering
 •    Advanced, Configurable Text Analysis
 •    Highly Configurable and User Extensible Caching
 •    Performance Optimizations
 •    External Configuration via XML
 •    An Administration Interface
 •    Monitorable Logging
 •    Fast Incremental Updates and Index Replication
 •    Highly Scalable Distributed search with sharded index across multiple hosts
 •    XML, CSV/delimited-text, and binary update formats
 •    Easy ways to pull in data from databases and XML files from local disk and HTTP sources
 •    Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika
 •    Multiple search indices
© Synopsys 2013        6
Solr 4
 Why?
 • Solr
      – Faceting
      – Modernize GUI
      – Deprecate custom code
           – Auto-complete using Yahoo UI
           – Did you mean?
      – Use Tika for more mime types
           – ExtractingRequestHandler (Solr Cell)
 • Solr 4 (trunk)
      –    DirectSolrSpellChecker
      –    More like this
      –    Synonym list
      –    Save migration
© Synopsys 2013   7
Front-End
 Screenshot




© Synopsys 2013   8
Front-End
 Research
 • How should we build new front-end?
      – Classic
           – JSF
           – JSP / Servlet (MVC)
      – Leverage framework
           – Apache Velocity
           – SolrJ
           – SolrJS
           – Myfaces
           – Ajax Solr




© Synopsys 2013   9
Front-End
 Research
 • Ajax Solr versus SolrJS
      – SolrJS (deprecated)
           – not fully IE 6, 7, 8 compatible
           – No highlight / sorting support
      – Ajax Solr
           – AbstractFacetWidget methods for faceting
           – AbstractTextWidget
           – PagerWidget for pagination
           – AutoComplete
           – Community weak




© Synopsys 2013   10
Front-End
 Ajax Solr
 • Ajax Solr
      – Advantage: Widgets
           – Save settings
           – Auto Complete
           – Query submit
           – Sort /display results
           – Pagination
           – Facet by product
           – Facet by doc type
           – JQUERY / JSON friendly
      – Challenges:
           – Session management
           – Proxy solution

© Synopsys 2013   11
Front-End
 Screenshot




© Synopsys 2013   12
Front-End
 Ajax Solr – JSON Object data - Firebug




© Synopsys 2013   13
Front-End
 DirectSolrSpellChecker – Auto Suggest




© Synopsys 2013   14
Front-End
 Extend Solr Highlighter




© Synopsys 2013   15
Back-end
 Tokenization
 • Carry custom tokenization work forward from Lucene
      – Change functionality – operator (ex: +delay_mode_zero)
 • Used text_rev xml configuration to reverse tokens for
   reverse index feature
      – Enables wildcard searching in front of string
      – *lock* *lock clock*
      – Apache Solr Mailing list community was very helpful




© Synopsys 2013   16
Back-end
 Tokenizer – text_rev configuration
 <!-- Similar to fieldtype text except text_rev reverses the characters of
     each token, to enable more efficient leading wildcard queries. -->
  <fieldType name="text_rev" class="solr.TextField" sortMissingLast="true" omitNorms="true">
    <analyzer type="index">
          <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
          <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="com.synopsys.ies.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
 <!-- Disable reverse indexing to save disk space and improve speed! -->
        <!-- filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
          maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/-->
    </analyzer>
    <analyzer type="query">
          <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
          <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    </analyzer>
© Synopsys 2013    17
Back-end
 Strip out the noise
 Custom Input Stream Filter – strip out the noise




© Synopsys 2013   18
Back-end
 Sharding
 • A different way to shard
      – Many shards mapped to one collection
      – Shards used for easy maintenance (not performance)
           – One shard per documentation version (12 total)
           – One shard for articles
           – One for release notes
           – One shard for internal only articles
      – Full re-index Articles, Release Notes every few hours
           – Simpler implementation
      – Index Documentation – as needed




© Synopsys 2013   19
Trunk to Beta
 Minor annoyance
 • After go live – Solr 4 beta shipped
      – Minor changes
      – Tika and Zookeeper upgraded
      – ContentStreamUpdateRequest.addFile()
           –      addFile(File file) became addFile(File file, String contentType)
      – New setLuceneMatchVersion
           – LUCENE_4
           – Added to make unit tests work properly
 • Production remains on Solr 4 beta
      – Will migrate to Solr 4.1 production mid year




© Synopsys 2013    20
Future
 What remains
 • More tuning
      – Human and machine learning approaches
 • NRT indexing
      – Use article hits to boost results (Most popular sort)
      – Leverage article rating data
 • No SQL like features
      – Customer profile data




© Synopsys 2013   21
Demo




© Synopsys 2013   22
Special Thanks…
 Thank you Chris and Erik - Apachecon 2010




© Synopsys 2013   23
Final Thoughts
 Thank you Lucid Works
 Thank you for hosting this Meetup and your commitment to
 the Apache Community…




© Synopsys 2013   24
Q and A / Contact Me

  Questions?




© Synopsys 2013   25

From Lucene to Solr 4 Trunk

  • 1.
    From Lucene toSolr 4 Trunk We Made It! SF Bay Lucene / Solr Meetup Troy Thomas 17 Jan 2013 © Synopsys 2013 1
  • 2.
    Lucene to Solr4 Trunk Agenda • Company - Background • Project Inspiration • Why Solr 4 – Why Trunk? • Architecture (Front to Back) • Trunk to Beta • Future • Demo • Q and A © Synopsys 2013 2
  • 3.
    Company - Background Synopsys – What? • Synopsys – 25 year old company / 1.8B 2012 revenue – Electronic Design Automation (EDA) – Electrical engineers design computer chips using Synopsys – Verilog, VHDL - High level design – Simulation – Test – Power – Place and route – IP blocks – Nearly every semiconductor built uses Synopsys… microprocessors, RAM, etc. © Synopsys 2013 3
  • 4.
    Company Background Synopsys– SolvNet ® • SolvNet ® - online knowledge base system used by customers and employees – Dedicated engineering team • 20 year history – 1993 Email – 1995 A “Patchy” NCSA Web server + PERL CGI – 1997 Verity Netscape Search – 2001 Java – Netscape Iplanet Portal + Verity – 2005 Apache Lucene – 2007 Pure Apache – 2012 Solr 4 © Synopsys 2013 4
  • 5.
    Lucene It’s complicated… • Moved to Lucene in 2005 – Custom tokenization helped results – Ex: +delay_mode_zero • Auto-complete function 2008 – Yahoo UI Widget • Tomcat w/ RMI callback • PDF Text extraction using PDFBox • HTML parser • Generate Lucene documents – Add to index • Separate collections – Articles, Docs © Synopsys 2013 5
  • 6.
    Project Inspiration Apachecon- Solr • Advanced Full-Text Search Capabilities • Optimized for High Volume Web Traffic • Standards Based Open Interfaces - XML,JSON and HTTP • Comprehensive HTML Administration Interfaces • Server statistics exposed over JMX for monitoring • Scalability - Efficient Replication to other Solr Search Servers • Flexible and Adaptable with XML configuration • Extensible Plugin Architecture • Solr Uses the Lucene Search Library and Extends it! • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys • Powerful Extensions to the Lucene Query Language • Faceted Search and Filtering • Advanced, Configurable Text Analysis • Highly Configurable and User Extensible Caching • Performance Optimizations • External Configuration via XML • An Administration Interface • Monitorable Logging • Fast Incremental Updates and Index Replication • Highly Scalable Distributed search with sharded index across multiple hosts • XML, CSV/delimited-text, and binary update formats • Easy ways to pull in data from databases and XML files from local disk and HTTP sources • Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika • Multiple search indices © Synopsys 2013 6
  • 7.
    Solr 4 Why? • Solr – Faceting – Modernize GUI – Deprecate custom code – Auto-complete using Yahoo UI – Did you mean? – Use Tika for more mime types – ExtractingRequestHandler (Solr Cell) • Solr 4 (trunk) – DirectSolrSpellChecker – More like this – Synonym list – Save migration © Synopsys 2013 7
  • 8.
  • 9.
    Front-End Research •How should we build new front-end? – Classic – JSF – JSP / Servlet (MVC) – Leverage framework – Apache Velocity – SolrJ – SolrJS – Myfaces – Ajax Solr © Synopsys 2013 9
  • 10.
    Front-End Research •Ajax Solr versus SolrJS – SolrJS (deprecated) – not fully IE 6, 7, 8 compatible – No highlight / sorting support – Ajax Solr – AbstractFacetWidget methods for faceting – AbstractTextWidget – PagerWidget for pagination – AutoComplete – Community weak © Synopsys 2013 10
  • 11.
    Front-End Ajax Solr • Ajax Solr – Advantage: Widgets – Save settings – Auto Complete – Query submit – Sort /display results – Pagination – Facet by product – Facet by doc type – JQUERY / JSON friendly – Challenges: – Session management – Proxy solution © Synopsys 2013 11
  • 12.
  • 13.
    Front-End Ajax Solr– JSON Object data - Firebug © Synopsys 2013 13
  • 14.
    Front-End DirectSolrSpellChecker –Auto Suggest © Synopsys 2013 14
  • 15.
    Front-End Extend SolrHighlighter © Synopsys 2013 15
  • 16.
    Back-end Tokenization •Carry custom tokenization work forward from Lucene – Change functionality – operator (ex: +delay_mode_zero) • Used text_rev xml configuration to reverse tokens for reverse index feature – Enables wildcard searching in front of string – *lock* *lock clock* – Apache Solr Mailing list community was very helpful © Synopsys 2013 16
  • 17.
    Back-end Tokenizer –text_rev configuration <!-- Similar to fieldtype text except text_rev reverses the characters of each token, to enable more efficient leading wildcard queries. --> <fieldType name="text_rev" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer type="index"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <!-- Disable reverse indexing to save disk space and improve speed! --> <!-- filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/--> </analyzer> <analyzer type="query"> <tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> © Synopsys 2013 17
  • 18.
    Back-end Strip outthe noise Custom Input Stream Filter – strip out the noise © Synopsys 2013 18
  • 19.
    Back-end Sharding •A different way to shard – Many shards mapped to one collection – Shards used for easy maintenance (not performance) – One shard per documentation version (12 total) – One shard for articles – One for release notes – One shard for internal only articles – Full re-index Articles, Release Notes every few hours – Simpler implementation – Index Documentation – as needed © Synopsys 2013 19
  • 20.
    Trunk to Beta Minor annoyance • After go live – Solr 4 beta shipped – Minor changes – Tika and Zookeeper upgraded – ContentStreamUpdateRequest.addFile() – addFile(File file) became addFile(File file, String contentType) – New setLuceneMatchVersion – LUCENE_4 – Added to make unit tests work properly • Production remains on Solr 4 beta – Will migrate to Solr 4.1 production mid year © Synopsys 2013 20
  • 21.
    Future What remains • More tuning – Human and machine learning approaches • NRT indexing – Use article hits to boost results (Most popular sort) – Leverage article rating data • No SQL like features – Customer profile data © Synopsys 2013 21
  • 22.
  • 23.
    Special Thanks… Thankyou Chris and Erik - Apachecon 2010 © Synopsys 2013 23
  • 24.
    Final Thoughts Thankyou Lucid Works Thank you for hosting this Meetup and your commitment to the Apache Community… © Synopsys 2013 24
  • 25.
    Q and A/ Contact Me Questions? © Synopsys 2013 25