0
Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System ...
<ul>Here's what to expect... </ul><ul><li>Overview of ADS
Overview of Invenio
Our Solr-Invenio Integration Project
A few tips on Solr hacking along the way  </li></ul>
<ul>The ADS Project </ul><ul><li>Established in 1989 (before the web!) as a portal for accessing astronomical data and bib...
Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics  </...
<ul>ADS Holdings </ul><ul><li>Almost 9M bibliographic metadata records
625K fulltext articles
Painstakingly curated collection of citations and links to fulltext and data products </li></ul><ul>ADS Services </ul><ul>...
Search, Browse, Notifications, Personalization
API access to all content (TWITA)
Network of 12 mirror sites
ADS Labs:  http://labs.adsabs.harvard.edu </li></ul>
 
 
 
Never heard of  ? <ul><li>1993: Started its life at CERN as a preprint server
2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the...
Renamed  CDS Invenio  and then  Invenio
Both an institutional repository and a digital library
Check it out!  ->   http://invenio-software.org/ </li></ul>
Why choose Invenio? <ul><li>ADS and Invenio share the same objectives: store and disseminate information to scientific com...
Growing penetration in the field of physics
Metadata curation tools (record editor, merger)
Support of citations graphs and citation-based searches
Second-order searches support </li></ul>
Under the hood <ul><li>Written in Python, mod_wsgi, some C and Lisp
Coupled with MySQL only (for now)
Scales to sets of 2M+ records
MARC storage of records
Modular architecture with: </li><ul><li>OAI harvesting, OAI server
Format conversion (MARCXML, DC, NLM, etc)
References and citations handler
Plot and figure extraction </li></ul></ul>
invenio.intbitset <ul><li>Sets of Invenio record IDs (MARC controlfield 001)
In-house C implementation of Python sets </li></ul><ul><li>Fast dumping and loading marshalling functions
Stored marshalled in the database and used as such in the search engine </li></ul>
Invenio sounds great! Why use Solr then? <ul><li>Invenio's search engine has trouble with 9M+ record (work-in-progress)
Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
Solr has a wide community of users/developers and lots of extensions. </li></ul>
Issues with the integration <ul><li>Keeping the metadata on both systems in sync
Invenio's search engine requires full sets of results
Upcoming SlideShare
Loading in...5
×

Letting In the Light: Using Solr as an External Search Component

4,336

Published on

Letting In the Light: Using Solr as an External Search Component

* Jay Luker, IT Specialist, ADS, jluker@cfa.harvard.edu
* Benoit Thiell, software developer, ADS, bthiell@cfa.harvard.edu

Code4Lib 2011, Tuesday 8 February, 14:30 - 14:50

It’s well-established that Solr provides an excellent foundation for building a faceted search engine. But what if your application’s foundation has already been constructed? How do you add Solr as a federated, fulltext search component to an existing system that already provides a full set of well-crafted scoring and ranking mechanisms?

This talk will describe a work-in-progress project at the Smithsonian/NASA Astrophysics Data System to migrate its aging search platform to Invenio, an open-source institutional repository and digital library system originally developed at CERN, while at the same time incorporating Solr as an external component for both faceting and fulltext search.

In this presentation we'll start with a short introduction of Invenio and then move on to the good stuff: an in-depth exploration of our use of Solr. We'll explain the challenges that we faced, what we learned about some particular Solr internals, interesting paths we chose not to follow, and the solutions we finally developed, including the creation of custom Solr request handlers and query parser classes.

This presentation will be quite technical and will show a measure of horrible Java code. Benoit will probably run away during that part.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
4,336
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
48
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.
  • 1994 was the move to the web
  • Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations) Curated links: 23M (fulltext, data products, citations) 4M scanned pages, 625K articles 650K pages historical material Advanced search allows for searching by astronomical object (via SIMBAD) and attributes like “has dataset” TWITA = The Website Is The API: via data_type=&lt;foo&gt; param, also structured metadata within the pages
  • INSPIRE: Invenio for SPIRES, the Physics database at Stanford.
  • Obviously, performance was also an objective Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
  • When we talk about the ids being sent back and forth between Invenio &amp; Solr we are talking about the schema ids.
  • So what&apos;s going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.
  • QueryResultMaxDocsCached QueryResultWindowSize enableLazyFieldLoading
  • No need to specify number of rows or which fields to return
  • Post-processing = 2 nd order searching, filtering Can&apos;t retreive facets with the initial query because the final list of search results will depend on Invenio post-processing. So how do you send a very large set of ids to get a set of facet results?
  • Satisfies all most objectives. We get searching &amp; faceting We don&apos;t have to write a lot of python or java: invenio needs the indexing piece Not duplicating anything that invenio already does very well Loosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece
  • Seems like a lot, but in total lines of code it&apos;s not that much, especially considering it&apos;s in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking. Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a “hack”.
  • Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  • A query component class has two opportunities to interact with the incoming request: prepare &amp; process. We only need process.
  • These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python
  • Defining our custom query component and telling the default solr search handler to use it Also defining our custom response writer
  • PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  • PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java&apos;s Native Invocation Interface (JNI).
  • Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came around In spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code. Writing as little java as possible doesn&apos;t just come from a java-phobic frame of mind; it&apos;s also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords. Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.
  • Transcript of "Letting In the Light: Using Solr as an External Search Component"

    1. 1. Letting In The Light Using Solr as an External Search Component Jay Luker Benoit Thiell SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
    2. 2. <ul>Here's what to expect... </ul><ul><li>Overview of ADS
    3. 3. Overview of Invenio
    4. 4. Our Solr-Invenio Integration Project
    5. 5. A few tips on Solr hacking along the way </li></ul>
    6. 6. <ul>The ADS Project </ul><ul><li>Established in 1989 (before the web!) as a portal for accessing astronomical data and bibliographic metadata
    7. 7. Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive
    8. 8. Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics </li></ul>
    9. 9. <ul>ADS Holdings </ul><ul><li>Almost 9M bibliographic metadata records
    10. 10. 625K fulltext articles
    11. 11. Painstakingly curated collection of citations and links to fulltext and data products </li></ul><ul>ADS Services </ul><ul><li>Free!
    12. 12. Search, Browse, Notifications, Personalization
    13. 13. API access to all content (TWITA)
    14. 14. Network of 12 mirror sites
    15. 15. ADS Labs: http://labs.adsabs.harvard.edu </li></ul>
    16. 19. Never heard of ? <ul><li>1993: Started its life at CERN as a preprint server
    17. 20. 2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project
    18. 21. Renamed CDS Invenio and then Invenio
    19. 22. Both an institutional repository and a digital library
    20. 23. Check it out! -> http://invenio-software.org/ </li></ul>
    21. 24. Why choose Invenio? <ul><li>ADS and Invenio share the same objectives: store and disseminate information to scientific communities
    22. 25. Growing penetration in the field of physics
    23. 26. Metadata curation tools (record editor, merger)
    24. 27. Support of citations graphs and citation-based searches
    25. 28. Second-order searches support </li></ul>
    26. 29. Under the hood <ul><li>Written in Python, mod_wsgi, some C and Lisp
    27. 30. Coupled with MySQL only (for now)
    28. 31. Scales to sets of 2M+ records
    29. 32. MARC storage of records
    30. 33. Modular architecture with: </li><ul><li>OAI harvesting, OAI server
    31. 34. Format conversion (MARCXML, DC, NLM, etc)
    32. 35. References and citations handler
    33. 36. Plot and figure extraction </li></ul></ul>
    34. 37. invenio.intbitset <ul><li>Sets of Invenio record IDs (MARC controlfield 001)
    35. 38. In-house C implementation of Python sets </li></ul><ul><li>Fast dumping and loading marshalling functions
    36. 39. Stored marshalled in the database and used as such in the search engine </li></ul>
    37. 40. Invenio sounds great! Why use Solr then? <ul><li>Invenio's search engine has trouble with 9M+ record (work-in-progress)
    38. 41. Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository
    39. 42. Solr has a wide community of users/developers and lots of extensions. </li></ul>
    40. 43. Issues with the integration <ul><li>Keeping the metadata on both systems in sync
    41. 44. Invenio's search engine requires full sets of results
    42. 45. Communicate over HTTP with very large payloads </li></ul>
    43. 46. Invenio + Solr
    44. 48. <ul>Objectives </ul><ul><li>Take advantage of Solr fulltext indexing & searching
    45. 49. Take advantage of Solr faceting
    46. 50. Not duplicate existing Invenio functionality
    47. 51. Write as little code as possible
    48. 52. Keep things loosely coupled </li></ul>
    49. 53. Problem #1 Retrieving very large result set of ids. Like, millions.
    50. 54. The WTH Approach http://myhost:8983/solr/select? q={foo} & fl=id & rows={n} Query for foo Only return the id field Return n rows of the result
    51. 55. (A bit about “ids”) Schema ids Lucene ids <ul><li>Defined in your schema.xml
    52. 56. Can be integers, strings, etc
    53. 57. Typically set as the <uniqueKey> </li></ul><ul><li>Internal to Lucene
    54. 58. Always integers
    55. 59. Unique within an index segment </li></ul>
    56. 60. The WTH Approach * warmed cache, different servers, same LAN seconds
    57. 61. So what's going on here? document cache Query Response QueryResult [1,5,16,84,...] Lucene Doc id: 1234, bibcode: <lazy>, Title: <lazy>, ...
    58. 62. Solution: Custom Collector QueryResult [1,5,16,84,...] Query Response
    59. 63. Solution: Custom Collector ... InvenioIdCollector collector = new InvenioIdCollector(); searcher.search(query, collector); ArrayList<Integer> ids = collector.getIds(); rsp.add(“ids”, ids); ... MyQueryComponent.java ... ArrayList<Integer> ids = new ArrayList<Intger>(); ... Public void collect(int doc) { this.ids.add(this.idMap[doc]); } ... MyCollector.java
    60. 64. OK, Let's Try This Again http://myhost:8983/solr/select? q={foo} & qt=my_querytype Query for foo Use our custom query handler
    61. 66. Better. But ...
    62. 67. Problem #2 Facets.
    63. 68. Fulltext Search Record Ids Invenio What's Missing? Solr Query Processing Post-processing Return/Render
    64. 69. Fulltext Search Record ids Invenio Again, WTH? Record ids? Facets Solr Query Processing Post-processing Return/Render
    65. 70. Fulltext Search Invenio BitSet Invenio Current Solution Invenio BitSet Facets Solr Query Processing Post-processing Return/Render
    66. 71. <ul>Parts Required </ul><ul><li>Custom QueryComponent for accepting fulltext search query and returning an Integer BitSet
    67. 72. Custom Collector to collect doc ids
    68. 73. Custom BitSet class (maybe)
    69. 74. Custom BinaryResponseWriter
    70. 75. Custom QueryComponent for accepting an Integer BitSet query and returning facets </li></ul>
    71. 76. Invenio Query Component Config <searchComponent name=&quot; invenio_query &quot; class=&quot;org.ads.solr.InvenioQueryComponent&quot; /> <requestHandler name=&quot;invenio_query&quot; class=&quot;solr.SearchHandler&quot;> <lst name=”defaults”> <str name=”wt”>bitset_stream</str> </lst> <arr name=&quot;components&quot;> <str> invenio_query </str> <str>stats</str> </arr> </requestHandler> ... <queryResponseWriter name=&quot;bitset_stream&quot; class=&quot;org.ads.solr.InvenioBitsetStreamResponseWriter&quot;/> solrconfig.xml
    72. 77. Invenio Query Component public void process(ResponseBuilder rb) throws IOException { SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher(); InvenioIdCollector collector = new InvenioIdCollector(); SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); Query query = cmd.getQuery(); searcher.search(query, collector ); InvenioBitSet bitset = collector .getBitSet(); rsp.add(&quot;bitset&quot;, bitset); } InvenioQueryComponent.java
    73. 78. Invenio Id Collector public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase; try { this. idMap = FieldCache.DEFAULT.getInts( this.reader, &quot;id&quot;); } catch (IOException e) { SolrException.logOnce( SolrCore.log, &quot;Exception during idMap init&quot;, e); } } InvenioIdCollector.java
    74. 79. Response Writer public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = (InvenioBitSet) rsp.getValues().get(&quot;bitset&quot;); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED); try { zOut.write( bitset .toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, &quot;Exception during compression/output of bitset&quot;, e); } } InvenioBitsetStreamResponseWriter.java
    75. 81. Invenio Facet Component Config <searchComponent name=&quot; invenio_facets &quot; class=&quot;org.ads.solr.InvenioFacetComponent&quot; /> <requestHandler name=&quot;/invenio_facets&quot; class=&quot;solr.SearchHandler&quot;> <lst name=&quot;defaults&quot;> <str name=&quot;wt&quot;>json</str> <str name=&quot;q.op&quot;>OR</str> <str name=&quot;rows&quot;>0</str> <str name=&quot;facet&quot;>true</str> <str name=&quot;facet.field&quot;>author_facet</str> ... </lst> <arr name=&quot;components&quot;> <str> invenio_facets </str> <str>facet</str> </arr> </requestHandler> solrconfig.xml
    76. 82. A bit of python r = urllib2.Request(facet_query_url) data = bitset.fastdump() boundary = mimetools.choose_boundary() contents = '--%srn' % boundary contents += 'Content-Disposition: form-data;' + 'name=&quot;bitset&quot;; filename=&quot;bitset&quot;rn' contents += 'Content-Type: application/octet-streamrn' contents += 'rn' + data + 'rn' contents += '--%s--rnrn' % boundary r.add_data(contents) r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary) u = urllib2.urlopen(r) facet_data = simplejson.load(u)
    77. 83. Facet Query Component ... Iterable<ContentStream> streams = req.getContentStreams(); ... InputStream is = stream.getStream(); ByteArrayOutputStream bOut = new ByteArrayOutputStream(); ZInputStream zIn = new ZinputStream(is); IOUtils.copy(zIn, bOut); InvenioBitSet bitset = new InvenioBitSet(bOut.toByteArray()); ... InvenioFacetComponent.java
    78. 84. Facet Query Component (cont.) ... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while ( bitset .nextSetBit(i) != -1) { int nextBit = bitset .nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter .add(lucene_id); i = nextBit + 1; } ... SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); cmd.setFilter( docSetFilter ); SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ... InvenioFacetComponent.java
    79. 86. Pylucene Embedded solr cpython within Java ... Alternative Approaches
    80. 87. <ul><li>Further Study </li></ul><ul><li>Can we make use of Solr's OpenBitSet?
    81. 88. Is there a way to bypass the Collector stage completely?
    82. 89. How can we return document scores?
    83. 90. Alternative approaches: pylucene, pylucene + solr, cpython within Java. </li></ul>
    84. 91. <ul>Thanks! </ul>Thanks also to: <ul><li>The ADS Team, @adsabs
    85. 92. The Invenio Team, especially...
    86. 93. Roman Chyla
    87. 94. Jan Iwaszkiewicz </li></ul>https://github.com/lbjay/solr-invenio
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×