From Lucene to Solr 4 Trunk

From Lucene to Solr 4 Trunk

We Made It!
SF Bay Lucene / Solr Meetup

Troy Thomas
17 Jan 2013

© Synopsys 2013 1

Lucene to Solr 4 Trunk
Agenda
• Company - Background
• Project Inspiration
• Why Solr 4 – Why Trunk?
• Architecture (Front to Back)
• Trunk to Beta
• Future
• Demo
• Q and A

© Synopsys 2013 2

Company - Background
Synopsys – What?
• Synopsys – 25 year old company / 1.8B 2012 revenue
– Electronic Design Automation (EDA)
– Electrical engineers design computer chips using Synopsys
– Verilog, VHDL - High level design
– Simulation
– Test
– Power
– Place and route
– IP blocks

– Nearly every semiconductor built uses Synopsys…
microprocessors, RAM, etc.

© Synopsys 2013 3

Company Background
Synopsys – SolvNet ®
• SolvNet ® - online knowledge base system used by
customers and employees
– Dedicated engineering team
• 20 year history
– 1993 Email
– 1995 A “Patchy” NCSA Web server + PERL CGI
– 1997 Verity Netscape Search
– 2001 Java – Netscape Iplanet Portal + Verity
– 2005 Apache Lucene
– 2007 Pure Apache
– 2012 Solr 4

© Synopsys 2013 4

Lucene
It’s complicated…
• Moved to Lucene in 2005
– Custom tokenization helped results
– Ex: +delay_mode_zero
• Auto-complete function 2008
– Yahoo UI Widget
• Tomcat w/ RMI callback
• PDF Text extraction using PDFBox
• HTML parser
• Generate Lucene documents
– Add to index
• Separate collections – Articles, Docs

© Synopsys 2013 5

Project Inspiration
Apachecon - Solr
• Advanced Full-Text Search Capabilities
• Optimized for High Volume Web Traffic
• Standards Based Open Interfaces - XML,JSON and HTTP
• Comprehensive HTML Administration Interfaces
• Server statistics exposed over JMX for monitoring
• Scalability - Efficient Replication to other Solr Search Servers
• Flexible and Adaptable with XML configuration
• Extensible Plugin Architecture
• Solr Uses the Lucene Search Library and Extends it!
• A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
• Powerful Extensions to the Lucene Query Language
• Faceted Search and Filtering
• Advanced, Configurable Text Analysis
• Highly Configurable and User Extensible Caching
• Performance Optimizations
• External Configuration via XML
• An Administration Interface
• Monitorable Logging
• Fast Incremental Updates and Index Replication
• Highly Scalable Distributed search with sharded index across multiple hosts
• XML, CSV/delimited-text, and binary update formats
• Easy ways to pull in data from databases and XML files from local disk and HTTP sources
• Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using Apache Tika
• Multiple search indices
© Synopsys 2013 6

Solr 4
Why?
• Solr
– Faceting
– Modernize GUI
– Deprecate custom code
– Auto-complete using Yahoo UI
– Did you mean?
– Use Tika for more mime types
– ExtractingRequestHandler (Solr Cell)
• Solr 4 (trunk)
– DirectSolrSpellChecker
– More like this
– Synonym list
– Save migration
© Synopsys 2013 7

Front-End
Screenshot

© Synopsys 2013 8

Front-End
Research
• How should we build new front-end?
– Classic
– JSF
– JSP / Servlet (MVC)
– Leverage framework
– Apache Velocity
– SolrJ
– SolrJS
– Myfaces
– Ajax Solr

© Synopsys 2013 9

Front-End
Research
• Ajax Solr versus SolrJS
– SolrJS (deprecated)
– not fully IE 6, 7, 8 compatible
– No highlight / sorting support
– Ajax Solr
– AbstractFacetWidget methods for faceting
– AbstractTextWidget
– PagerWidget for pagination
– AutoComplete
– Community weak

© Synopsys 2013 10

Front-End
Ajax Solr
• Ajax Solr
– Advantage: Widgets
– Save settings
– Auto Complete
– Query submit
– Sort /display results
– Pagination
– Facet by product
– Facet by doc type
– JQUERY / JSON friendly
– Challenges:
– Session management
– Proxy solution

© Synopsys 2013 11

Front-End
Screenshot

© Synopsys 2013 12

Front-End
Ajax Solr – JSON Object data - Firebug

© Synopsys 2013 13

Front-End
DirectSolrSpellChecker – Auto Suggest

© Synopsys 2013 14

Front-End
Extend Solr Highlighter

© Synopsys 2013 15

Back-end
Tokenization
• Carry custom tokenization work forward from Lucene
– Change functionality – operator (ex: +delay_mode_zero)
• Used text_rev xml configuration to reverse tokens for
reverse index feature
– Enables wildcard searching in front of string
– *lock* *lock clock*
– Apache Solr Mailing list community was very helpful

© Synopsys 2013 16

Back-end
Tokenizer – text_rev configuration

<fieldType name="text_rev" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer type="index">
<tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="com.synopsys.ies.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>


</analyzer>
<analyzer type="query">
<tokenizer class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
<filter class="com.synopsys.ies.solr.backend.analysis.standard.SolvNetFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
© Synopsys 2013 17

Back-end
Sharding
• A different way to shard
– Many shards mapped to one collection
– Shards used for easy maintenance (not performance)
– One shard per documentation version (12 total)
– One shard for articles
– One for release notes
– One shard for internal only articles
– Full re-index Articles, Release Notes every few hours
– Simpler implementation
– Index Documentation – as needed

© Synopsys 2013 19

Trunk to Beta
Minor annoyance
• After go live – Solr 4 beta shipped
– Minor changes
– Tika and Zookeeper upgraded
– ContentStreamUpdateRequest.addFile()
– addFile(File file) became addFile(File file, String contentType)
– New setLuceneMatchVersion
– LUCENE_4
– Added to make unit tests work properly
• Production remains on Solr 4 beta
– Will migrate to Solr 4.1 production mid year

© Synopsys 2013 20

Future
What remains
• More tuning
– Human and machine learning approaches
• NRT indexing
– Use article hits to boost results (Most popular sort)
– Leverage article rating data
• No SQL like features
– Customer profile data

© Synopsys 2013 21

Final Thoughts
Thank you Lucid Works
Thank you for hosting this Meetup and your commitment to
the Apache Community…

© Synopsys 2013 24

From Lucene to Solr 4 Trunk

More Related Content

What's hot

Similar to From Lucene to Solr 4 Trunk

Recently uploaded

From Lucene to Solr 4 Trunk