SlideShare a Scribd company logo
LuSql: (Quickly and easily)Getting your
  data from your DBMS into Lucene
                Glen Newton
         CISTI Research, CISTI NRC
                code4lib 2009
        Providence Rhode Island Feb 24 2009
Outline

•   What is LuSql?
•   Context
•   Examples
•   Performance & comparisons
•   Next version
Context

• CISTI == Canada National Science Library
• Digital Library Research Group
• Heavy text mining, knowledge discovery tools, information
  visualization, citation analysis, recommender systems
• Large local text collection:
   – 8.4M PDFs, full text & metadata (~700GB)
   – Full text on file system
   – Metadata in MySql
• Team of 4: Lucene expert; 3 needing to use Lucene
• Daily creation of some experiment/domain/foo – specific large
  scale Lucene index
LuSql Rationale

• Need for low barrier, high performance, flexible tool for Lucene
  index creation
• Choice
   – SOLR
   – DBSight
   – Lucen4DB.net
   – Hibernate Search
   – Compass
• All one or more of:
   – Overly complicated for non-Lucene / non-Java / non-XML / non-
      framework users
   – Performance/scalability issues
   – Not Open Source Software (OSS)
LuSql

• User knowledge:
   – Knowledge of SQL
   – Knowledge of their database and tables
   – Ability to set the Java CLASSPATH in a command-line shell
   – Ability to run a command line application
LuSql Command Line
                                  Arguments
–   Create or append
–   SQL
–   JDBC URL
–   # records to index
–   Lucene Analyzer class
–   JDBC driver class
–   Indexing properties, global or by field
–   Global term value (i.e. all documents have quot;source=catquot;)
–   Lucene RAM buffer size
–   Stop word file
–   Lucene index directory
–   Multithreading toggle, #threads
–   Pluggable Document filter
–   Subqueries
Examples

java jar lusql.jar -q quot;select * from Article where
  volumeYear > 2007quot; -c
  quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n
  5 -l tutorial -I 211 -t
Index Term Properties

• Index: Default:TOKENIZED
   – 0:NO
   – 1:NO_NORMS
   – 2:TOKENIZED
   – 3:UN_TOKENIZED
• Store: Default:YES
   – 0:NO
   – 1:YES
   – 2:COMPRESS
• Term vector: Default:YES
   – 0:NO
   – 1:YES
Example 2: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 2: Large Scale
                                    Index Propertries
•   6.4M articles (only metadata)
•   20 fields, including abstract
•   Indexing time: 1h 34m
•   Index size: 21GB
Example 3: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 4: Out-of-band
                                 document manipulation
• Plugin architecture allowing arbitrary manipulations of Documents
  before they go into the index
• Implement DocFilter interface
• Add filter class at command line:
   – -f ca.nrc.cisti.lusql.example.FileFullTextFilter
   – Looks in metadata field for PDF location in file system; finds
     corresponding .txt file; reads file & adds to Document
Example 4: Large Scale
                                   Index Properties
•   6.4M articles (metadata & full-text), ~600GB PDFs
•   21 fields, including abstract & full-text
•   Indexing time: 13h 46m
•   Index size: 86GB
Comparison to SOLR

• SOLR 1.4 November build:
   – Using DataImportHandler, with all defaults
• Lucene 2.4
• LuSql 0.90
Hardware
• Indexing and database machines:
   – Dell PowerEdge 1955 Blade server56
   – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0
     Ghz 64bit
   – Memory: 8 GB 667MHz
   – Disk: 2 x 73GB internal 10K RPM SAS drives
• Both machines attached to:
   – Dell EMC AX150 storage arrays
   – 12 x 500 GB SATA II 7.2K RPM disks
   – via:
   – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
Software
• MySql: v5.0.45 compiled from source.
• gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)
• Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0
  07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23,
  mixed mode)
• Operating System
   – Linux openSUSE 10.2 (64-bit X86-64)
   – Linux kernel: 2.6.18.8-0.10-default #1 SMP
Comparison to SOLR
Comparison to SOLR
Acknowledgments

• Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
LuSql 0.95 in
                                   development
• Re-architected to have pluggable drivers for both input & output
• Read drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier
• Write drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI,
     Terrier
• For large volumes, concurrent multiple indexes merged at end
Questions

• Glen Newton glen.newton@nrc-cnrc.gc.ca
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

More Related Content

What's hot

Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
Edward Capriolo
 
Using memcache to improve php performance
Using memcache to improve php performanceUsing memcache to improve php performance
Using memcache to improve php performance
Sudar Muthu
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
Jignesh Shah
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
Fabio Fumarola
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
IT Event
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Gear6
 
캐시 분산처리 인프라
캐시 분산처리 인프라캐시 분산처리 인프라
캐시 분산처리 인프라
Park Chunduck
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
René Cannaò
 
Using advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JUsing advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/J
MariaDB plc
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
Marco Tusa
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
hastexo
 
Level DB - Quick Cheat Sheet
Level DB - Quick Cheat SheetLevel DB - Quick Cheat Sheet
Level DB - Quick Cheat Sheet
Aniruddha Chakrabarti
 
Proxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteProxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynote
Marco Tusa
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
CUBRID
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by Step
Kim Stefan Lindholm
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web app
Georgio_1999
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Mydbops
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
Mydbops
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
elliando dias
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 

What's hot (20)

Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Using memcache to improve php performance
Using memcache to improve php performanceUsing memcache to improve php performance
Using memcache to improve php performance
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
캐시 분산처리 인프라
캐시 분산처리 인프라캐시 분산처리 인프라
캐시 분산처리 인프라
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Using advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JUsing advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/J
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Level DB - Quick Cheat Sheet
Level DB - Quick Cheat SheetLevel DB - Quick Cheat Sheet
Level DB - Quick Cheat Sheet
 
Proxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteProxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynote
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by Step
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web app
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 

Viewers also liked

Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
eby
 
DLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationDLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendation
eby
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
Jonathan-Andornot
 
Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochure
guest1b72b3
 
RESTafarian-ism at the NLA
RESTafarian-ism at the NLARESTafarian-ism at the NLA
RESTafarian-ism at the NLA
eby
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangle
eby
 
Como evitar enfermarse
Como evitar enfermarseComo evitar enfermarse
Como evitar enfermarse
correosencadena
 
CouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeCouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilege
eby
 
The Future of Vertical Search Engines
The Future of Vertical Search EnginesThe Future of Vertical Search Engines
The Future of Vertical Search Engines
Ted Drake
 
BABOK Study Group - meeting 4
BABOK Study Group - meeting 4BABOK Study Group - meeting 4
BABOK Study Group - meeting 4
Paweł Zubkiewicz
 

Viewers also liked (10)

Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
DLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationDLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendation
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochure
 
RESTafarian-ism at the NLA
RESTafarian-ism at the NLARESTafarian-ism at the NLA
RESTafarian-ism at the NLA
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangle
 
Como evitar enfermarse
Como evitar enfermarseComo evitar enfermarse
Como evitar enfermarse
 
CouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeCouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilege
 
The Future of Vertical Search Engines
The Future of Vertical Search EnginesThe Future of Vertical Search Engines
The Future of Vertical Search Engines
 
BABOK Study Group - meeting 4
BABOK Study Group - meeting 4BABOK Study Group - meeting 4
BABOK Study Group - meeting 4
 

Similar to LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
Bob Ward
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
Bob Ward
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Ceph
CephCeph
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositories
Iryna Kuchma
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
Evergreen ILS
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
mkorremans
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 

Similar to LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene (20)

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositories
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 

More from eby

Using a CSS Framework
Using a CSS FrameworkUsing a CSS Framework
Using a CSS Framework
eby
 
XForms for Metadata creation
XForms for Metadata creationXForms for Metadata creation
XForms for Metadata creation
eby
 
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
eby
 
ÖpënÜRL
ÖpënÜRLÖpënÜRL
ÖpënÜRL
eby
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehills
eby
 
Zotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic WebZotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic Web
eby
 
Creating an Academic Image Collection with Flickr
Creating an Academic Image Collection with FlickrCreating an Academic Image Collection with Flickr
Creating an Academic Image Collection with Flickr
eby
 
From Idea to Open Source
From Idea to Open SourceFrom Idea to Open Source
From Idea to Open Source
eby
 
Obstacles to Agility
Obstacles to AgilityObstacles to Agility
Obstacles to Agility
eby
 
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
eby
 
Code4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher KeynoteCode4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher Keynote
eby
 
Library Data APIs Abound!
Library Data APIs Abound!Library Data APIs Abound!
Library Data APIs Abound!
eby
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendations
eby
 
On The Herding of Cats
On The Herding of CatsOn The Herding of Cats
On The Herding of Cats
eby
 
Code4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch PortalCode4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch Portal
eby
 
Code4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeCode4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's time
eby
 
Evergreen - Future of the ILS
Evergreen - Future of the ILSEvergreen - Future of the ILS
Evergreen - Future of the ILS
eby
 
Lucene Summit - Beth's Presentation
Lucene Summit - Beth's PresentationLucene Summit - Beth's Presentation
Lucene Summit - Beth's Presentation
eby
 

More from eby (18)

Using a CSS Framework
Using a CSS FrameworkUsing a CSS Framework
Using a CSS Framework
 
XForms for Metadata creation
XForms for Metadata creationXForms for Metadata creation
XForms for Metadata creation
 
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
 
ÖpënÜRL
ÖpënÜRLÖpënÜRL
ÖpënÜRL
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehills
 
Zotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic WebZotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic Web
 
Creating an Academic Image Collection with Flickr
Creating an Academic Image Collection with FlickrCreating an Academic Image Collection with Flickr
Creating an Academic Image Collection with Flickr
 
From Idea to Open Source
From Idea to Open SourceFrom Idea to Open Source
From Idea to Open Source
 
Obstacles to Agility
Obstacles to AgilityObstacles to Agility
Obstacles to Agility
 
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
 
Code4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher KeynoteCode4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher Keynote
 
Library Data APIs Abound!
Library Data APIs Abound!Library Data APIs Abound!
Library Data APIs Abound!
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendations
 
On The Herding of Cats
On The Herding of CatsOn The Herding of Cats
On The Herding of Cats
 
Code4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch PortalCode4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch Portal
 
Code4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeCode4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's time
 
Evergreen - Future of the ILS
Evergreen - Future of the ILSEvergreen - Future of the ILS
Evergreen - Future of the ILS
 
Lucene Summit - Beth's Presentation
Lucene Summit - Beth's PresentationLucene Summit - Beth's Presentation
Lucene Summit - Beth's Presentation
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 

LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

  • 1. LuSql: (Quickly and easily)Getting your data from your DBMS into Lucene Glen Newton CISTI Research, CISTI NRC code4lib 2009 Providence Rhode Island Feb 24 2009
  • 2. Outline • What is LuSql? • Context • Examples • Performance & comparisons • Next version
  • 3. Context • CISTI == Canada National Science Library • Digital Library Research Group • Heavy text mining, knowledge discovery tools, information visualization, citation analysis, recommender systems • Large local text collection: – 8.4M PDFs, full text & metadata (~700GB) – Full text on file system – Metadata in MySql • Team of 4: Lucene expert; 3 needing to use Lucene • Daily creation of some experiment/domain/foo – specific large scale Lucene index
  • 4. LuSql Rationale • Need for low barrier, high performance, flexible tool for Lucene index creation • Choice – SOLR – DBSight – Lucen4DB.net – Hibernate Search – Compass • All one or more of: – Overly complicated for non-Lucene / non-Java / non-XML / non- framework users – Performance/scalability issues – Not Open Source Software (OSS)
  • 5. LuSql • User knowledge: – Knowledge of SQL – Knowledge of their database and tables – Ability to set the Java CLASSPATH in a command-line shell – Ability to run a command line application
  • 6. LuSql Command Line Arguments – Create or append – SQL – JDBC URL – # records to index – Lucene Analyzer class – JDBC driver class – Indexing properties, global or by field – Global term value (i.e. all documents have quot;source=catquot;) – Lucene RAM buffer size – Stop word file – Lucene index directory – Multithreading toggle, #threads – Pluggable Document filter – Subqueries
  • 7.
  • 8. Examples java jar lusql.jar -q quot;select * from Article where volumeYear > 2007quot; -c quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n 5 -l tutorial -I 211 -t
  • 9. Index Term Properties • Index: Default:TOKENIZED – 0:NO – 1:NO_NORMS – 2:TOKENIZED – 3:UN_TOKENIZED • Store: Default:YES – 0:NO – 1:YES – 2:COMPRESS • Term vector: Default:YES – 0:NO – 1:YES
  • 10.
  • 11. Example 2: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 12.
  • 13. Example 2: Large Scale Index Propertries • 6.4M articles (only metadata) • 20 fields, including abstract • Indexing time: 1h 34m • Index size: 21GB
  • 14.
  • 15.
  • 16. Example 3: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 17. Example 4: Out-of-band document manipulation • Plugin architecture allowing arbitrary manipulations of Documents before they go into the index • Implement DocFilter interface • Add filter class at command line: – -f ca.nrc.cisti.lusql.example.FileFullTextFilter – Looks in metadata field for PDF location in file system; finds corresponding .txt file; reads file & adds to Document
  • 18. Example 4: Large Scale Index Properties • 6.4M articles (metadata & full-text), ~600GB PDFs • 21 fields, including abstract & full-text • Indexing time: 13h 46m • Index size: 86GB
  • 19. Comparison to SOLR • SOLR 1.4 November build: – Using DataImportHandler, with all defaults • Lucene 2.4 • LuSql 0.90
  • 20. Hardware • Indexing and database machines: – Dell PowerEdge 1955 Blade server56 – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit – Memory: 8 GB 667MHz – Disk: 2 x 73GB internal 10K RPM SAS drives • Both machines attached to: – Dell EMC AX150 storage arrays – 12 x 500 GB SATA II 7.2K RPM disks – via: – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
  • 21. Software • MySql: v5.0.45 compiled from source. • gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux) • Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0 07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) • Operating System – Linux openSUSE 10.2 (64-bit X86-64) – Linux kernel: 2.6.18.8-0.10-default #1 SMP
  • 24.
  • 25.
  • 26. Acknowledgments • Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
  • 27. LuSql 0.95 in development • Re-architected to have pluggable drivers for both input & output • Read drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier • Write drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI, Terrier • For large volumes, concurrent multiple indexes merged at end
  • 28. Questions • Glen Newton glen.newton@nrc-cnrc.gc.ca