• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Code4Lib 2007: MyResearch Portal
 

Code4Lib 2007: MyResearch Portal

on

  • 2,026 views

Presentation by Andrew Nagy at Code4Lib 2007 in Athens, GA. ...

Presentation by Andrew Nagy at Code4Lib 2007 in Athens, GA.

Villanova University’s Falvey Memorial Library has longed for a beautiful pig; however, we determined in early 2006 that pigs were only good at searching for truffles, so we decided to build our own OPAC.

After developing our own custom Digital Library from a Native XML Database, we quickly appreciated the ease of development with XQuery and XSLT. We then launched full speed ahead into the development of a new OPAC from scratch using XML technologies and MARCXML.

This presentation will describe the process of choosing an NXDB and optimizing it for large data set performance. Developing searches that take about 2 minutes to process and optimizing them down to about 2 seconds. I will also describe the development processes of the OPAC interface including the AJAX features we have implemented. I will share our success stories and our failures.

Statistics

Views

Total Views
2,026
Views on SlideShare
1,986
Embed Views
40

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 40

http://code4lib.org 31
http://www.code4lib.org 9

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Code4Lib 2007: MyResearch Portal Code4Lib 2007: MyResearch Portal Presentation Transcript

  • MyResearch Portal An XML based Catalog-independent OPAC
      • Andrew Nagy
      • Villanova University
  • Goal
    • Develop a completely ILS agnostic web portal for students and faculty to perform research activities:
      • Search library catalog
      • Search article databases
        • And other local library catalogs
      • Search digital library
    • Create 1 single interface for all library resources to minimize interface learning curve!
  • How do we develop this?
    • Develop in-house a “framework” to combine all of our resources
    • Most resources are in XML
      • Digital Library: METS
      • Metalib XServer: XML
      • Catalog: MARCXML
      • Library Web Site: XHTML
  • Application Layout
  • Application Layout Application Controller User Interface HTML, XUL, WML, RSS, etc File System eXist Metalib XServer SOLR ILS Driver Data Controller
  • Data Store
    • Native XML stores allows for easy storage of complex data
    • No need to develop a complete relational database and convert data – too messy
    • No need to normalize data
    • Just import!
  • Native XML Databases
      • Could it be that simple?
  • Open Source!
    • eXist
      • Still in infancy-ish stages
      • Platform independent
        • Java Backend
        • API: REST, SOAP
      • Full-text extension
      • Inherent directory structure
      • LDAP support
      • Large user base
  • Open Source!
    • Berkeley DB XML
      • Proven capabilities
      • Supports a wide range of platforms
      • Good performance
      • Decent help support
      • Commercial Backing
      • No full-text extensions
      • No inherent directories
  • COTS
    • MarkLogic
      • Enticing Discounts for .edu & non-profits
      • Commercial Support
      • Much more complex to administer
      • Speed?
    • X-Hive DB
      • Too pricey!
      • We aren’t Princeton!
  • Scalability Testing
    • Round 1: 650 records
      • eXist respond quickly
      • Gave us the ability to develop the application
    • Round 2: 33,000 records
      • eXist: response within 15 minutes
    • Round 3: 100,000 records
      • eXist: no response
      • dbxml: 45 Minutes for response
    • Round 4: 492,000 records
      • eXist: Java Heap Space error (maxed mem)
      • dbxml: 2.5 Hours for response
  • Scalability Testing
    • Round 5: 492,000 records
      • Forget eXist
      • DB Xml: Sleepycat/oracle support modified indexes and XQuery statements
        • Results in 30 – 60 seconds
    • Round 6: 492,000 Modified MarcXML
      • Transformed MarcXML records to a custom format
        • Stripped content not necessary for searching
        • Gave element names meaning!
  • MarcXML
    • <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>
    • <collection xmlns=&quot;http://www.loc.gov/MARC21/slim&quot;>
    • <record>
    • <leader>00745cam a22002531a 45 </leader>
    • <controlfield tag=&quot;001&quot;>10</controlfield>
    • <controlfield tag=&quot;008&quot;>800215s1979 caua b 000 0 eng d</controlfield>
    • <datafield tag=&quot;010&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;> 79065492 </subfield>
    • </datafield>
    • <datafield tag=&quot;020&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>0816200963</subfield>
    • </datafield>
    • <datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>ocm05985801</subfield>
    • </datafield>
    • <datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>(CaOTULAS)157315020</subfield>
    • </datafield>
    • <datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;9&quot;>00000012</subfield>
    • </datafield>
    • <datafield tag=&quot;040&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>TOL</subfield>
    • <subfield code=&quot;c&quot;>TOL</subfield>
    • <subfield code=&quot;d&quot;>PVU</subfield>
    • </datafield>
    • <datafield tag=&quot;049&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>PVUM</subfield>
    • <subfield code=&quot;c&quot;>[0457425]</subfield>
    • </datafield>
    • <datafield tag=&quot;090&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>HD30.25</subfield>
    • <subfield code=&quot;b&quot;>.A35</subfield>
    • </datafield>
    • <datafield tag=&quot;099&quot; ind1=&quot;1&quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>HD30.25.A35</subfield>
    • </datafield>
    • <datafield tag=&quot;100&quot; ind1=&quot;1&quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>Aggarwal, Raj.</subfield>
    • </datafield>
    • <datafield tag=&quot;245&quot; ind1=&quot;1&quot; ind2=&quot;0&quot;>
    • <subfield code=&quot;a&quot;>Management science :</subfield>
    • <subfield code=&quot;b&quot;>cases and applications /</subfield>
    • <subfield code=&quot;c&quot;>Raj Aggarwal, Inder Khera.</subfield>
    • </datafield>
    • <datafield tag=&quot;260&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>San Francisco :</subfield>
    • <subfield code=&quot;b&quot;>Holden-Day,</subfield>
    • <subfield code=&quot;c&quot;>c1979.</subfield>
    • </datafield>
    • <datafield tag=&quot;300&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>xiii, 229 p. :</subfield>
    • <subfield code=&quot;b&quot;>ill. ;</subfield>
    • <subfield code=&quot;c&quot;>23 cm.</subfield>
    • </datafield>
    • <datafield tag=&quot;504&quot; ind1=&quot; &quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>Bibliography: p. 21.</subfield>
    • </datafield>
    • <datafield tag=&quot;650&quot; ind1=&quot; &quot; ind2=&quot;0&quot;>
    • <subfield code=&quot;a&quot;>Management</subfield>
    • <subfield code=&quot;x&quot;>Mathematical models.</subfield>
    • </datafield>
    • <datafield tag=&quot;650&quot; ind1=&quot; &quot; ind2=&quot;0&quot;>
    • <subfield code=&quot;a&quot;>Operations research.</subfield>
    • </datafield>
    • <datafield tag=&quot;700&quot; ind1=&quot;1&quot; ind2=&quot; &quot;>
    • <subfield code=&quot;a&quot;>Khera, Inder Pal,</subfield>
    • <subfield code=&quot;d&quot;>1937-</subfield>
    • </datafield>
    • </record>
  • Modified XML
    • <record xmlns=&quot;http://www.w3.org/1999/xhtml&quot;>
    • <leader>00745cam a22002531a 45 </leader>
    • <T001>10</T001>
    • <T008>800215s1979 caua b 000 0 eng d</controlfield>
    • <T020>0816200963</T020>
    • <T090>HD30.25.A35</T090>
    • <T100>Aggarwal, Raj.</T100>
    • <T245>Management science :cases and applications /</T245>
    • <T260>c1979.</T260>
    • <T650>Management</T650>
    • <T650>Mathematical models.</T650>
    • <T650>Operations research.</T650>
    • <T700>Khera, Inder Pal,</T700>
    • </record>
  • Scalability Testing
    • eXist:
    • dbxml:
      • Initial basic searches: 1.4s – 1.6s
  • Query Optimization
    • This is an important step since we are dealing with infant technology
    • dbxml has a query plan generator
    • eXist will soon have a query plan generator and a new query optimizer
  • Manual XQuery Optimization
    • Initial Title Search:
    • for $record in $coll //marcxml:record[marcxml:datafield[@tag=&quot;245&quot;]/marcxml:subfield[@code=&quot;a&quot; and contains(.,“Augustine&quot;)]]
    • return $record
      • 116 Minutes
    • Optimized Title Search:
    • for $record in $coll //marcxml:record[marcxml:datafield/marcxml:subfield[contains(.,“Augustine&quot;)]]
    • where $record/marcxml:datafield[@tag=&quot;245&quot;]/marcxml:subfield[@code=&quot;a&quot;]
    • return $record
      • 2.42 Minutes
  • eXist in Production In Production Since August 2006 top - 16:36:36 up 148 days, 3:11, 0 users, load average: 0.05, 0.01, 0.00 Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 99.9% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 8162268k total, 7070344k used, 1091924k free, 272152k buffers Swap: 16779852k total, 372296k used, 16407556k free, 4304256k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27996 root 16 0 416m 162m 2928 S 0.0 2.0 0:04.49 java 16478 daemon 16 0 164m 46m 11m S 0.0 0.6 0:22.67 httpd 17203 daemon 16 0 155m 37m 11m S 0.0 0.5 0:09.72 httpd 25279 root 19 0 362m 35m 2508 S 0.0 0.4 0:05.17 java 17213 daemon 16 0 144m 26m 11m S 0.0 0.3 0:04.25 httpd 17233 daemon 16 0 143m 25m 11m S 0.0 0.3 0:01.46 httpd 17242 daemon 16 0 142m 24m 11m S 0.0 0.3 0:00.54 httpd 27445 root 16 0 361m 22m 2496 S 0.0 0.3 0:04.55 java 17235 daemon 16 0 139m 21m 11m S 0.0 0.3 0:00.54 httpd 17252 daemon 16 0 138m 21m 11m S 0.0 0.3 0:00.28 httpd 20314 root 20 0 364m 20m 1308 S 0.0 0.3 0:04.60 java 17231 daemon 15 0 137m 19m 11m S 0.0 0.2 0:00.36 httpd 17241 daemon 16 0 133m 15m 10m S 0.0 0.2 0:00.06 httpd 17236 daemon 16 0 133m 14m 10m S 0.0 0.2 0:00.02 httpd 22294 root 16 0 133m 14m 10m S 0.0 0.2 0:00.79 httpd
  • Implementation
      • Create a web portal using a Native XML Database
  • Performance
    • The Good:
      • Average searches: .9 seconds
    • The Bad:
      • More advanced queries can get as high as 12 - 15 seconds
    • The Ugly:
      • What happens when 10-50 simultaneous users search with advanced queries
  • Implementation
    • Needed to develop lots of search query translation algorithms due to missing Full Text Extension
      • Convert author searches to Last Name, First Name
      • Convert Call number searches to Uppercase
      • Uppercase 1 st letter of subject terms and lowercase all remaining characters
      • Many more …
  • Short Answer:
      • Not Yet!
  • It’s a Sun Shiny Day
    • Apache SOLR to the rescue!
    • SOLR implements a Lucene index on XML documents
    • SOLR is platform independent
      • Runs as a java web app
      • Interfaced via REST
    • Lots of full-text searching tools
    • No Standards compliant interface
  • SOLR Performance
    • Performance is astonishing
    • Average results in .1 seconds over 492,000+ records
    • Slower performance with built-in faceting
  • Easy Implementation
    • XSL Stylesheet to convert MARCXML to SOLR XML
      • Converted 492,107 records in 8973.6784570217 seconds
      • 2.5 Hours
    • SOLR Import
      • 3 Hours
  • SOLR vs NXDB
    • NXDB
      • Native XML storage system
      • Standards: XQuery, XPath, XUpdate, etc.
      • XQuery
        • Inherent XSL transformation
        • Logical language
    • SOLR
      • Plain out fast
      • Non-native data format
      • Very simple interface though non standard query format
      • No logical query language
  • Other untested options
    • Nux – Berkeley Lab
      • Implements an XQuery interface to Lucene
      • No REST interface
    • XTF - CDL
      • Indexes XML in Lucene
      • No XQuery? – JAVA only
    • Cheshire3 – UC Berkeley
      • Native XML indexer
      • Not as popular as Lucene
    • Sedna NXDB
    • Commercial NXDBs
  • Beta OPAC
    • SOLR Backend
    • AJAX functions for Non-Bib Data Lookup
    • Combine Catalog, Article Search, Digital Library into 1 single search interface
    • Central web-accessible location for saving records
    • Tags and Comments for Web 2.0 fun
    • Beta Implementation
  • Questions?