MyResearch Portal An XML based Catalog-independent OPAC   <ul><ul><li>Andrew Nagy </li></ul></ul><ul><ul><li>Villanova Uni...
Goal <ul><li>Develop a completely ILS agnostic web portal for students and faculty to perform research activities: </li></...
How do we develop this? <ul><li>Develop in-house a “framework” to combine all of our resources </li></ul><ul><li>Most reso...
Application Layout
Application Layout Application Controller User Interface HTML, XUL, WML, RSS, etc File System eXist Metalib XServer SOLR I...
Data Store <ul><li>Native XML stores allows for easy storage of complex data </li></ul><ul><li>No need to develop a comple...
Native XML Databases <ul><ul><li>Could it be that simple? </li></ul></ul>
Open Source! <ul><li>eXist </li></ul><ul><ul><li>Still in infancy-ish stages </li></ul></ul><ul><ul><li>Platform independe...
Open Source! <ul><li>Berkeley DB XML </li></ul><ul><ul><li>Proven capabilities </li></ul></ul><ul><ul><li>Supports a wide ...
COTS <ul><li>MarkLogic </li></ul><ul><ul><li>Enticing Discounts for .edu & non-profits </li></ul></ul><ul><ul><li>Commerci...
Scalability Testing <ul><li>Round 1: 650 records </li></ul><ul><ul><li>eXist respond quickly </li></ul></ul><ul><ul><li>Ga...
Scalability Testing <ul><li>Round 5: 492,000 records </li></ul><ul><ul><li>Forget eXist </li></ul></ul><ul><ul><li>DB Xml:...
MarcXML <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> </li></ul><ul><li><collection xmlns=&quot;http:...
Modified XML <ul><li><record xmlns=&quot;http://www.w3.org/1999/xhtml&quot;> </li></ul><ul><li><leader>00745cam a22002531a...
Scalability Testing <ul><li>eXist: </li></ul><ul><li>dbxml: </li></ul><ul><ul><li>Initial basic searches: 1.4s – 1.6s </li...
Query Optimization <ul><li>This is an important step since we are dealing with  infant technology </li></ul><ul><li>dbxml ...
Manual XQuery Optimization <ul><li>Initial Title Search: </li></ul><ul><li>for $record in $coll //marcxml:record[marcxml:d...
eXist in Production In Production Since August 2006 top - 16:36:36 up 148 days,  3:11,  0 users,  load average: 0.05, 0.01...
Implementation <ul><ul><li>Create a web portal using a Native XML Database </li></ul></ul>
Performance <ul><li>The Good:  </li></ul><ul><ul><li>Average searches: .9 seconds </li></ul></ul><ul><li>The Bad: </li></u...
Implementation <ul><li>Needed to develop lots of search query translation algorithms due to missing Full Text Extension </...
Short Answer: <ul><ul><li>Not Yet! </li></ul></ul>
It’s a Sun Shiny Day <ul><li>Apache SOLR to the rescue! </li></ul><ul><li>SOLR implements a Lucene index on XML documents ...
SOLR Performance <ul><li>Performance is astonishing </li></ul><ul><li>Average results in .1 seconds over 492,000+ records ...
Easy Implementation <ul><li>XSL Stylesheet to convert MARCXML to SOLR XML </li></ul><ul><ul><li>Converted 492,107 records ...
SOLR vs NXDB <ul><li>NXDB </li></ul><ul><ul><li>Native XML storage system </li></ul></ul><ul><ul><li>Standards: XQuery, XP...
Other untested options <ul><li>Nux – Berkeley Lab </li></ul><ul><ul><li>Implements an XQuery interface to Lucene </li></ul...
Beta OPAC <ul><li>SOLR Backend </li></ul><ul><li>AJAX functions for Non-Bib Data Lookup </li></ul><ul><li>Combine Catalog,...
Questions?
Upcoming SlideShare
Loading in …5
×

Code4Lib 2007: MyResearch Portal

1,266 views

Published on

Presentation by Andrew Nagy at Code4Lib 2007 in Athens, GA.

Villanova University’s Falvey Memorial Library has longed for a beautiful pig; however, we determined in early 2006 that pigs were only good at searching for truffles, so we decided to build our own OPAC.

After developing our own custom Digital Library from a Native XML Database, we quickly appreciated the ease of development with XQuery and XSLT. We then launched full speed ahead into the development of a new OPAC from scratch using XML technologies and MARCXML.

This presentation will describe the process of choosing an NXDB and optimizing it for large data set performance. Developing searches that take about 2 minutes to process and optimizing them down to about 2 seconds. I will also describe the development processes of the OPAC interface including the AJAX features we have implemented. I will share our success stories and our failures.

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,266
On SlideShare
0
From Embeds
0
Number of Embeds
49
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Code4Lib 2007: MyResearch Portal

    1. 1. MyResearch Portal An XML based Catalog-independent OPAC <ul><ul><li>Andrew Nagy </li></ul></ul><ul><ul><li>Villanova University </li></ul></ul>
    2. 2. Goal <ul><li>Develop a completely ILS agnostic web portal for students and faculty to perform research activities: </li></ul><ul><ul><li>Search library catalog </li></ul></ul><ul><ul><li>Search article databases </li></ul></ul><ul><ul><ul><li>And other local library catalogs </li></ul></ul></ul><ul><ul><li>Search digital library </li></ul></ul><ul><li>Create 1 single interface for all library resources to minimize interface learning curve! </li></ul>
    3. 3. How do we develop this? <ul><li>Develop in-house a “framework” to combine all of our resources </li></ul><ul><li>Most resources are in XML </li></ul><ul><ul><li>Digital Library: METS </li></ul></ul><ul><ul><li>Metalib XServer: XML </li></ul></ul><ul><ul><li>Catalog: MARCXML </li></ul></ul><ul><ul><li>Library Web Site: XHTML </li></ul></ul>
    4. 4. Application Layout
    5. 5. Application Layout Application Controller User Interface HTML, XUL, WML, RSS, etc File System eXist Metalib XServer SOLR ILS Driver Data Controller
    6. 6. Data Store <ul><li>Native XML stores allows for easy storage of complex data </li></ul><ul><li>No need to develop a complete relational database and convert data – too messy </li></ul><ul><li>No need to normalize data </li></ul><ul><li>Just import! </li></ul>
    7. 7. Native XML Databases <ul><ul><li>Could it be that simple? </li></ul></ul>
    8. 8. Open Source! <ul><li>eXist </li></ul><ul><ul><li>Still in infancy-ish stages </li></ul></ul><ul><ul><li>Platform independent </li></ul></ul><ul><ul><ul><li>Java Backend </li></ul></ul></ul><ul><ul><ul><li>API: REST, SOAP </li></ul></ul></ul><ul><ul><li>Full-text extension </li></ul></ul><ul><ul><li>Inherent directory structure </li></ul></ul><ul><ul><li>LDAP support </li></ul></ul><ul><ul><li>Large user base </li></ul></ul>
    9. 9. Open Source! <ul><li>Berkeley DB XML </li></ul><ul><ul><li>Proven capabilities </li></ul></ul><ul><ul><li>Supports a wide range of platforms </li></ul></ul><ul><ul><li>Good performance </li></ul></ul><ul><ul><li>Decent help support </li></ul></ul><ul><ul><li>Commercial Backing </li></ul></ul><ul><ul><li>No full-text extensions </li></ul></ul><ul><ul><li>No inherent directories </li></ul></ul>
    10. 10. COTS <ul><li>MarkLogic </li></ul><ul><ul><li>Enticing Discounts for .edu & non-profits </li></ul></ul><ul><ul><li>Commercial Support </li></ul></ul><ul><ul><li>Much more complex to administer </li></ul></ul><ul><ul><li>Speed? </li></ul></ul><ul><li>X-Hive DB </li></ul><ul><ul><li>Too pricey! </li></ul></ul><ul><ul><li>We aren’t Princeton! </li></ul></ul>
    11. 11. Scalability Testing <ul><li>Round 1: 650 records </li></ul><ul><ul><li>eXist respond quickly </li></ul></ul><ul><ul><li>Gave us the ability to develop the application </li></ul></ul><ul><li>Round 2: 33,000 records </li></ul><ul><ul><li>eXist: response within 15 minutes </li></ul></ul><ul><li>Round 3: 100,000 records </li></ul><ul><ul><li>eXist: no response </li></ul></ul><ul><ul><li>dbxml: 45 Minutes for response </li></ul></ul><ul><li>Round 4: 492,000 records </li></ul><ul><ul><li>eXist: Java Heap Space error (maxed mem) </li></ul></ul><ul><ul><li>dbxml: 2.5 Hours for response </li></ul></ul>
    12. 12. Scalability Testing <ul><li>Round 5: 492,000 records </li></ul><ul><ul><li>Forget eXist </li></ul></ul><ul><ul><li>DB Xml: Sleepycat/oracle support modified indexes and XQuery statements </li></ul></ul><ul><ul><ul><li>Results in 30 – 60 seconds </li></ul></ul></ul><ul><li>Round 6: 492,000 Modified MarcXML </li></ul><ul><ul><li>Transformed MarcXML records to a custom format </li></ul></ul><ul><ul><ul><li>Stripped content not necessary for searching </li></ul></ul></ul><ul><ul><ul><li>Gave element names meaning! </li></ul></ul></ul>
    13. 13. MarcXML <ul><li><?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> </li></ul><ul><li><collection xmlns=&quot;http://www.loc.gov/MARC21/slim&quot;> </li></ul><ul><li><record> </li></ul><ul><li><leader>00745cam a22002531a 45 </leader> </li></ul><ul><li><controlfield tag=&quot;001&quot;>10</controlfield> </li></ul><ul><li><controlfield tag=&quot;008&quot;>800215s1979 caua b 000 0 eng d</controlfield> </li></ul><ul><li><datafield tag=&quot;010&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;> 79065492 </subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;020&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>0816200963</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>ocm05985801</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>(CaOTULAS)157315020</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;035&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;9&quot;>00000012</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;040&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>TOL</subfield> </li></ul><ul><li><subfield code=&quot;c&quot;>TOL</subfield> </li></ul><ul><li><subfield code=&quot;d&quot;>PVU</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;049&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>PVUM</subfield> </li></ul><ul><li><subfield code=&quot;c&quot;>[0457425]</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;090&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>HD30.25</subfield> </li></ul><ul><li><subfield code=&quot;b&quot;>.A35</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;099&quot; ind1=&quot;1&quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>HD30.25.A35</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;100&quot; ind1=&quot;1&quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Aggarwal, Raj.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;245&quot; ind1=&quot;1&quot; ind2=&quot;0&quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Management science :</subfield> </li></ul><ul><li><subfield code=&quot;b&quot;>cases and applications /</subfield> </li></ul><ul><li><subfield code=&quot;c&quot;>Raj Aggarwal, Inder Khera.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;260&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>San Francisco :</subfield> </li></ul><ul><li><subfield code=&quot;b&quot;>Holden-Day,</subfield> </li></ul><ul><li><subfield code=&quot;c&quot;>c1979.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;300&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>xiii, 229 p. :</subfield> </li></ul><ul><li><subfield code=&quot;b&quot;>ill. ;</subfield> </li></ul><ul><li><subfield code=&quot;c&quot;>23 cm.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;504&quot; ind1=&quot; &quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Bibliography: p. 21.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;650&quot; ind1=&quot; &quot; ind2=&quot;0&quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Management</subfield> </li></ul><ul><li><subfield code=&quot;x&quot;>Mathematical models.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;650&quot; ind1=&quot; &quot; ind2=&quot;0&quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Operations research.</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li><datafield tag=&quot;700&quot; ind1=&quot;1&quot; ind2=&quot; &quot;> </li></ul><ul><li><subfield code=&quot;a&quot;>Khera, Inder Pal,</subfield> </li></ul><ul><li><subfield code=&quot;d&quot;>1937-</subfield> </li></ul><ul><li></datafield> </li></ul><ul><li></record> </li></ul>
    14. 14. Modified XML <ul><li><record xmlns=&quot;http://www.w3.org/1999/xhtml&quot;> </li></ul><ul><li><leader>00745cam a22002531a 45 </leader> </li></ul><ul><li><T001>10</T001> </li></ul><ul><li><T008>800215s1979 caua b 000 0 eng d</controlfield> </li></ul><ul><li><T020>0816200963</T020> </li></ul><ul><li><T090>HD30.25.A35</T090> </li></ul><ul><li><T100>Aggarwal, Raj.</T100> </li></ul><ul><li><T245>Management science :cases and applications /</T245> </li></ul><ul><li><T260>c1979.</T260> </li></ul><ul><li><T650>Management</T650> </li></ul><ul><li><T650>Mathematical models.</T650> </li></ul><ul><li><T650>Operations research.</T650> </li></ul><ul><li><T700>Khera, Inder Pal,</T700> </li></ul><ul><li></record> </li></ul>
    15. 15. Scalability Testing <ul><li>eXist: </li></ul><ul><li>dbxml: </li></ul><ul><ul><li>Initial basic searches: 1.4s – 1.6s </li></ul></ul>
    16. 16. Query Optimization <ul><li>This is an important step since we are dealing with infant technology </li></ul><ul><li>dbxml has a query plan generator </li></ul><ul><li>eXist will soon have a query plan generator and a new query optimizer </li></ul>
    17. 17. Manual XQuery Optimization <ul><li>Initial Title Search: </li></ul><ul><li>for $record in $coll //marcxml:record[marcxml:datafield[@tag=&quot;245&quot;]/marcxml:subfield[@code=&quot;a&quot; and contains(.,“Augustine&quot;)]] </li></ul><ul><li>return $record </li></ul><ul><ul><li>116 Minutes </li></ul></ul><ul><li>Optimized Title Search: </li></ul><ul><li>for $record in $coll //marcxml:record[marcxml:datafield/marcxml:subfield[contains(.,“Augustine&quot;)]] </li></ul><ul><li>where $record/marcxml:datafield[@tag=&quot;245&quot;]/marcxml:subfield[@code=&quot;a&quot;] </li></ul><ul><li>return $record </li></ul><ul><ul><li>2.42 Minutes </li></ul></ul>
    18. 18. eXist in Production In Production Since August 2006 top - 16:36:36 up 148 days, 3:11, 0 users, load average: 0.05, 0.01, 0.00 Tasks: 124 total, 1 running, 123 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 0.0% sy, 0.0% ni, 99.9% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 8162268k total, 7070344k used, 1091924k free, 272152k buffers Swap: 16779852k total, 372296k used, 16407556k free, 4304256k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27996 root 16 0 416m 162m 2928 S 0.0 2.0 0:04.49 java 16478 daemon 16 0 164m 46m 11m S 0.0 0.6 0:22.67 httpd 17203 daemon 16 0 155m 37m 11m S 0.0 0.5 0:09.72 httpd 25279 root 19 0 362m 35m 2508 S 0.0 0.4 0:05.17 java 17213 daemon 16 0 144m 26m 11m S 0.0 0.3 0:04.25 httpd 17233 daemon 16 0 143m 25m 11m S 0.0 0.3 0:01.46 httpd 17242 daemon 16 0 142m 24m 11m S 0.0 0.3 0:00.54 httpd 27445 root 16 0 361m 22m 2496 S 0.0 0.3 0:04.55 java 17235 daemon 16 0 139m 21m 11m S 0.0 0.3 0:00.54 httpd 17252 daemon 16 0 138m 21m 11m S 0.0 0.3 0:00.28 httpd 20314 root 20 0 364m 20m 1308 S 0.0 0.3 0:04.60 java 17231 daemon 15 0 137m 19m 11m S 0.0 0.2 0:00.36 httpd 17241 daemon 16 0 133m 15m 10m S 0.0 0.2 0:00.06 httpd 17236 daemon 16 0 133m 14m 10m S 0.0 0.2 0:00.02 httpd 22294 root 16 0 133m 14m 10m S 0.0 0.2 0:00.79 httpd
    19. 19. Implementation <ul><ul><li>Create a web portal using a Native XML Database </li></ul></ul>
    20. 20. Performance <ul><li>The Good: </li></ul><ul><ul><li>Average searches: .9 seconds </li></ul></ul><ul><li>The Bad: </li></ul><ul><ul><li>More advanced queries can get as high as 12 - 15 seconds </li></ul></ul><ul><li>The Ugly: </li></ul><ul><ul><li>What happens when 10-50 simultaneous users search with advanced queries </li></ul></ul>
    21. 21. Implementation <ul><li>Needed to develop lots of search query translation algorithms due to missing Full Text Extension </li></ul><ul><ul><li>Convert author searches to Last Name, First Name </li></ul></ul><ul><ul><li>Convert Call number searches to Uppercase </li></ul></ul><ul><ul><li>Uppercase 1 st letter of subject terms and lowercase all remaining characters </li></ul></ul><ul><ul><li>Many more … </li></ul></ul>
    22. 22. Short Answer: <ul><ul><li>Not Yet! </li></ul></ul>
    23. 23. It’s a Sun Shiny Day <ul><li>Apache SOLR to the rescue! </li></ul><ul><li>SOLR implements a Lucene index on XML documents </li></ul><ul><li>SOLR is platform independent </li></ul><ul><ul><li>Runs as a java web app </li></ul></ul><ul><ul><li>Interfaced via REST </li></ul></ul><ul><li>Lots of full-text searching tools </li></ul><ul><li>No Standards compliant interface </li></ul>
    24. 24. SOLR Performance <ul><li>Performance is astonishing </li></ul><ul><li>Average results in .1 seconds over 492,000+ records </li></ul><ul><li>Slower performance with built-in faceting </li></ul>
    25. 25. Easy Implementation <ul><li>XSL Stylesheet to convert MARCXML to SOLR XML </li></ul><ul><ul><li>Converted 492,107 records in 8973.6784570217 seconds </li></ul></ul><ul><ul><li>2.5 Hours </li></ul></ul><ul><li>SOLR Import </li></ul><ul><ul><li>3 Hours </li></ul></ul>
    26. 26. SOLR vs NXDB <ul><li>NXDB </li></ul><ul><ul><li>Native XML storage system </li></ul></ul><ul><ul><li>Standards: XQuery, XPath, XUpdate, etc. </li></ul></ul><ul><ul><li>XQuery </li></ul></ul><ul><ul><ul><li>Inherent XSL transformation </li></ul></ul></ul><ul><ul><ul><li>Logical language </li></ul></ul></ul><ul><li>SOLR </li></ul><ul><ul><li>Plain out fast </li></ul></ul><ul><ul><li>Non-native data format </li></ul></ul><ul><ul><li>Very simple interface though non standard query format </li></ul></ul><ul><ul><li>No logical query language </li></ul></ul>
    27. 27. Other untested options <ul><li>Nux – Berkeley Lab </li></ul><ul><ul><li>Implements an XQuery interface to Lucene </li></ul></ul><ul><ul><li>No REST interface </li></ul></ul><ul><li>XTF - CDL </li></ul><ul><ul><li>Indexes XML in Lucene </li></ul></ul><ul><ul><li>No XQuery? – JAVA only </li></ul></ul><ul><li>Cheshire3 – UC Berkeley </li></ul><ul><ul><li>Native XML indexer </li></ul></ul><ul><ul><li>Not as popular as Lucene </li></ul></ul><ul><li>Sedna NXDB </li></ul><ul><li>Commercial NXDBs </li></ul>
    28. 28. Beta OPAC <ul><li>SOLR Backend </li></ul><ul><li>AJAX functions for Non-Bib Data Lookup </li></ul><ul><li>Combine Catalog, Article Search, Digital Library into 1 single search interface </li></ul><ul><li>Central web-accessible location for saving records </li></ul><ul><li>Tags and Comments for Web 2.0 fun </li></ul><ul><li>Beta Implementation </li></ul>
    29. 29. Questions?

    ×