• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Letting In the Light: Using Solr as an External Search Component
 

Letting In the Light: Using Solr as an External Search Component

on

  • 4,391 views

Letting In the Light: Using Solr as an External Search Component ...

Letting In the Light: Using Solr as an External Search Component

* Jay Luker, IT Specialist, ADS, jluker@cfa.harvard.edu
* Benoit Thiell, software developer, ADS, bthiell@cfa.harvard.edu

Code4Lib 2011, Tuesday 8 February, 14:30 - 14:50

It’s well-established that Solr provides an excellent foundation for building a faceted search engine. But what if your application’s foundation has already been constructed? How do you add Solr as a federated, fulltext search component to an existing system that already provides a full set of well-crafted scoring and ranking mechanisms?

This talk will describe a work-in-progress project at the Smithsonian/NASA Astrophysics Data System to migrate its aging search platform to Invenio, an open-source institutional repository and digital library system originally developed at CERN, while at the same time incorporating Solr as an external component for both faceting and fulltext search.

In this presentation we'll start with a short introduction of Invenio and then move on to the good stuff: an in-depth exploration of our use of Solr. We'll explain the challenges that we faced, what we learned about some particular Solr internals, interesting paths we chose not to follow, and the solutions we finally developed, including the creation of custom Solr request handlers and query parser classes.

This presentation will be quite technical and will show a measure of horrible Java code. Benoit will probably run away during that part.

Statistics

Views

Total Views
4,391
Views on SlideShare
4,390
Embed Views
1

Actions

Likes
0
Downloads
37
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
  • 1994 was the move to the web
  • Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations) Curated links: 23M (fulltext, data products, citations) 4M scanned pages, 625K articles 650K pages historical material Advanced search allows for searching by astronomical object (via SIMBAD) and attributes like “has dataset” TWITA = The Website Is The API: via data_type= param, also structured metadata within the pages
  • INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
  • Obviously, performance was also an objective Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
  • When we talk about the ids being sent back and forth between Invenio & Solr we are talking about the schema ids.
  • So what's going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
  • QueryResultMaxDocsCached QueryResultWindowSize enableLazyFieldLoading
  • No need to specify number of rows or which fields to return
  • Post-processing = 2 nd order searching, filtering Can't retreive facets with the initial query because the final list of search results will depend on Invenio post-processing. So how do you send a very large set of ids to get a set of facet results?
  • Satisfies all most objectives. We get searching & faceting We don't have to write a lot of python or java: invenio needs the indexing piece Not duplicating anything that invenio already does very well Loosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
  • Seems like a lot, but in total lines of code it's not that much, especially considering it's in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking. Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a “hack”.
  • Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  • A query component class has two opportunities to interact with the incoming request: prepare & process. We only need process.
  • These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
  • Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  • PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
  • PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).
  • Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.

Letting In the Light: Using Solr as an External Search Component Letting In the Light: Using Solr as an External Search Component Presentation Transcript

  • Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
    • Here's what to expect...
    • Overview of ADS
    • Overview of Invenio
    • Our Solr-Invenio Integration Project
    • A few tips on Solr hacking along the way
    • The ADS Project
    • Established in 1989 (before the web!) as a portal for accessing astronomical data and bibliographic metadata
    • Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
    • Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics
    • ADS Holdings
    • Almost 9M bibliographic metadata records
    • 625K fulltext articles
    • Painstakingly curated collection of citations and links to fulltext and data products
      ADS Services
    • Free!
    • Search, Browse, Notifications, Personalization
    • API access to all content (TWITA)
    • Network of 12 mirror sites
    • ADS Labs: http://labs.adsabs.harvard.edu
  •  
  •  
  •  
  • Never heard of ?
    • 1993: Started its life at CERN as a preprint server
    • 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
    • Renamed CDS Invenio and then Invenio
    • Both an institutional repository and a digital library
    • Check it out! -> http://invenio-software.org/
  • Why choose Invenio?
    • ADS and Invenio share the same objectives: store and disseminate information to scientific communities
    • Growing penetration in the field of physics
    • Metadata curation tools (record editor, merger)
    • Support of citations graphs and citation-based searches
    • Second-order searches support
  • Under the hood
    • Written in Python, mod_wsgi, some C and Lisp
    • Coupled with MySQL only (for now)
    • Scales to sets of 2M+ records
    • MARC storage of records
    • Modular architecture with:
      • OAI harvesting, OAI server
      • Format conversion (MARCXML, DC, NLM, etc)
      • References and citations handler
      • Plot and figure extraction
  • invenio.intbitset
    • Sets of Invenio record IDs (MARC controlfield 001)
    • In-house C implementation of Python sets
    • Fast dumping and loading marshalling functions
    • Stored marshalled in the database and used as such in the search engine
  • Invenio sounds great! Why use Solr then?
    • Invenio's search engine has trouble with 9M+ record (work-in-progress)
    • Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
    • Solr has a wide community of users/developers and lots of extensions.
  • Issues with the integration
    • Keeping the metadata on both systems in sync
    • Invenio's search engine requires full sets of results
    • Communicate over HTTP with very large payloads
  • Invenio + Solr
  •  
    • Objectives
    • Take advantage of Solr fulltext indexing & searching
    • Take advantage of Solr faceting
    • Not duplicate existing Invenio functionality
    • Write as little code as possible
    • Keep things loosely coupled
  • Problem #1 Retrieving very large result set of ids. Like, millions.
  • The WTH Approach http://myhost:8983/solr/select? q={foo} & fl=id & rows={n} Query for foo Only return the id field Return n rows of the result
  • (A bit about “ids”) Schema ids Lucene ids
    • Defined in your schema.xml
    • Can be integers, strings, etc
    • Typically set as the <uniqueKey>
    • Internal to Lucene
    • Always integers
    • Unique within an index segment
  • The WTH Approach * warmed cache, different servers, same LAN seconds
  • So what's going on here? document cache Query Response QueryResult [1,5,16,84,...] Lucene Doc id: 1234, bibcode: <lazy>, Title: <lazy>, ...
  • Solution: Custom Collector QueryResult [1,5,16,84,...] Query Response
  • Solution: Custom Collector ... InvenioIdCollector collector = new InvenioIdCollector(); searcher.search(query, collector); ArrayList<Integer> ids = collector.getIds(); rsp.add(“ids”, ids); ... MyQueryComponent.java ... ArrayList<Integer> ids = new ArrayList<Intger>(); ... Public void collect(int doc) { this.ids.add(this.idMap[doc]); } ... MyCollector.java
  • OK, Let's Try This Again http://myhost:8983/solr/select? q={foo} & qt=my_querytype Query for foo Use our custom query handler
  •  
  • Better. But ...
  • Problem #2 Facets.
  • Fulltext Search Record Ids Invenio What's Missing? Solr Query Processing Post-processing Return/Render
  • Fulltext Search Record ids Invenio Again, WTH? Record ids? Facets Solr Query Processing Post-processing Return/Render
  • Fulltext Search Invenio BitSet Invenio Current Solution Invenio BitSet Facets Solr Query Processing Post-processing Return/Render
    • Parts Required
    • Custom QueryComponent for accepting fulltext search query and returning an Integer BitSet
    • Custom Collector to collect doc ids
    • Custom BitSet class (maybe)
    • Custom BinaryResponseWriter
    • Custom QueryComponent for accepting an Integer BitSet query and returning facets
  • Invenio Query Component Config <searchComponent name=&quot; invenio_query &quot; class=&quot;org.ads.solr.InvenioQueryComponent&quot; /> <requestHandler name=&quot;invenio_query&quot; class=&quot;solr.SearchHandler&quot;> <lst name=”defaults”> <str name=”wt”>bitset_stream</str> </lst> <arr name=&quot;components&quot;> <str> invenio_query </str> <str>stats</str> </arr> </requestHandler> ... <queryResponseWriter name=&quot;bitset_stream&quot; class=&quot;org.ads.solr.InvenioBitsetStreamResponseWriter&quot;/> solrconfig.xml
  • Invenio Query Component public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher(); InvenioIdCollector collector = new InvenioIdCollector(); SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); Query query = cmd.getQuery(); searcher.search(query, collector ); InvenioBitSet bitset = collector .getBitSet(); rsp.add(&quot;bitset&quot;, bitset); } InvenioQueryComponent.java
  • Invenio Id Collector public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase; try { this. idMap = FieldCache.DEFAULT.getInts( this.reader, &quot;id&quot;); } catch (IOException e) { SolrException.logOnce( SolrCore.log, &quot;Exception during idMap init&quot;, e); } } InvenioIdCollector.java
  • Response Writer public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = (InvenioBitSet) rsp.getValues().get(&quot;bitset&quot;); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED); try { zOut.write( bitset .toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, &quot;Exception during compression/output of bitset&quot;, e); } } InvenioBitsetStreamResponseWriter.java
  •  
  • Invenio Facet Component Config <searchComponent name=&quot; invenio_facets &quot; class=&quot;org.ads.solr.InvenioFacetComponent&quot; /> <requestHandler name=&quot;/invenio_facets&quot; class=&quot;solr.SearchHandler&quot;> <lst name=&quot;defaults&quot;> <str name=&quot;wt&quot;>json</str> <str name=&quot;q.op&quot;>OR</str> <str name=&quot;rows&quot;>0</str> <str name=&quot;facet&quot;>true</str> <str name=&quot;facet.field&quot;>author_facet</str> ... </lst> <arr name=&quot;components&quot;> <str> invenio_facets </str> <str>facet</str> </arr> </requestHandler> solrconfig.xml
  • A bit of python r = urllib2.Request(facet_query_url) data = bitset.fastdump() boundary = mimetools.choose_boundary() contents = '--%srn' % boundary contents += 'Content-Disposition: form-data;' + 'name=&quot;bitset&quot;; filename=&quot;bitset&quot;rn' contents += 'Content-Type: application/octet-streamrn' contents += 'rn' + data + 'rn' contents += '--%s--rnrn' % boundary r.add_data(contents) r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary) u = urllib2.urlopen(r) facet_data = simplejson.load(u)
  • Facet Query Component ... Iterable<ContentStream> streams = req.getContentStreams(); ... InputStream is = stream.getStream(); ByteArrayOutputStream bOut = new ByteArrayOutputStream(); ZInputStream zIn = new ZinputStream(is); IOUtils.copy(zIn, bOut); InvenioBitSet bitset = new InvenioBitSet(bOut.toByteArray()); ... InvenioFacetComponent.java
  • Facet Query Component (cont.) ... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while ( bitset .nextSetBit(i) != -1) { int nextBit = bitset .nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter .add(lucene_id); i = nextBit + 1; } ... SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); cmd.setFilter( docSetFilter ); SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ... InvenioFacetComponent.java
  •  
  • Pylucene Embedded solr cpython within Java ... Alternative Approaches
    • Further Study
    • Can we make use of Solr's OpenBitSet?
    • Is there a way to bypass the Collector stage completely?
    • How can we return document scores?
    • Alternative approaches: pylucene, pylucene + solr, cpython within Java.
    • Thanks!
    Thanks also to:
    • The ADS Team, @adsabs
    • The Invenio Team, especially...
    • Roman Chyla
    • Jan Iwaszkiewicz
    https://github.com/lbjay/solr-invenio