SlideShare a Scribd company logo
1 of 29
Download to read offline
LuSql: (Quickly and easily)Getting your
  data from your DBMS into Lucene
                Glen Newton
         CISTI Research, CISTI NRC
                code4lib 2009
        Providence Rhode Island Feb 24 2009
Outline

•   What is LuSql?
•   Context
•   Examples
•   Performance & comparisons
•   Next version
Context

• CISTI == Canada National Science Library
• Digital Library Research Group
• Heavy text mining, knowledge discovery tools, information
  visualization, citation analysis, recommender systems
• Large local text collection:
   – 8.4M PDFs, full text & metadata (~700GB)
   – Full text on file system
   – Metadata in MySql
• Team of 4: Lucene expert; 3 needing to use Lucene
• Daily creation of some experiment/domain/foo – specific large
  scale Lucene index
LuSql Rationale

• Need for low barrier, high performance, flexible tool for Lucene
  index creation
• Choice
   – SOLR
   – DBSight
   – Lucen4DB.net
   – Hibernate Search
   – Compass
• All one or more of:
   – Overly complicated for non-Lucene / non-Java / non-XML / non-
      framework users
   – Performance/scalability issues
   – Not Open Source Software (OSS)
LuSql

• User knowledge:
   – Knowledge of SQL
   – Knowledge of their database and tables
   – Ability to set the Java CLASSPATH in a command-line shell
   – Ability to run a command line application
LuSql Command Line
                                  Arguments
–   Create or append
–   SQL
–   JDBC URL
–   # records to index
–   Lucene Analyzer class
–   JDBC driver class
–   Indexing properties, global or by field
–   Global term value (i.e. all documents have quot;source=catquot;)
–   Lucene RAM buffer size
–   Stop word file
–   Lucene index directory
–   Multithreading toggle, #threads
–   Pluggable Document filter
–   Subqueries
Examples

java jar lusql.jar -q quot;select * from Article where
  volumeYear > 2007quot; -c
  quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n
  5 -l tutorial -I 211 -t
Index Term Properties

• Index: Default:TOKENIZED
   – 0:NO
   – 1:NO_NORMS
   – 2:TOKENIZED
   – 3:UN_TOKENIZED
• Store: Default:YES
   – 0:NO
   – 1:YES
   – 2:COMPRESS
• Term vector: Default:YES
   – 0:NO
   – 1:YES
Example 2: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 2: Large Scale
                                    Index Propertries
•   6.4M articles (only metadata)
•   20 fields, including abstract
•   Indexing time: 1h 34m
•   Index size: 21GB
Example 3: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 4: Out-of-band
                                 document manipulation
• Plugin architecture allowing arbitrary manipulations of Documents
  before they go into the index
• Implement DocFilter interface
• Add filter class at command line:
   – -f ca.nrc.cisti.lusql.example.FileFullTextFilter
   – Looks in metadata field for PDF location in file system; finds
     corresponding .txt file; reads file & adds to Document
Example 4: Large Scale
                                   Index Properties
•   6.4M articles (metadata & full-text), ~600GB PDFs
•   21 fields, including abstract & full-text
•   Indexing time: 13h 46m
•   Index size: 86GB
Comparison to SOLR

• SOLR 1.4 November build:
   – Using DataImportHandler, with all defaults
• Lucene 2.4
• LuSql 0.90
Hardware
• Indexing and database machines:
   – Dell PowerEdge 1955 Blade server56
   – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0
     Ghz 64bit
   – Memory: 8 GB 667MHz
   – Disk: 2 x 73GB internal 10K RPM SAS drives
• Both machines attached to:
   – Dell EMC AX150 storage arrays
   – 12 x 500 GB SATA II 7.2K RPM disks
   – via:
   – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
Software
• MySql: v5.0.45 compiled from source.
• gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)
• Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0
  07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23,
  mixed mode)
• Operating System
   – Linux openSUSE 10.2 (64-bit X86-64)
   – Linux kernel: 2.6.18.8-0.10-default #1 SMP
Comparison to SOLR
Comparison to SOLR
Acknowledgments

• Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
LuSql 0.95 in
                                   development
• Re-architected to have pluggable drivers for both input & output
• Read drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier
• Write drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI,
     Terrier
• For large volumes, concurrent multiple indexes merged at end
Questions

• Glen Newton glen.newton@nrc-cnrc.gc.ca
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

More Related Content

What's hot

Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
Jignesh Shah
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
CUBRID
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
elliando dias
 

What's hot (20)

Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Using memcache to improve php performance
Using memcache to improve php performanceUsing memcache to improve php performance
Using memcache to improve php performance
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
캐시 분산처리 인프라
캐시 분산처리 인프라캐시 분산처리 인프라
캐시 분산처리 인프라
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Using advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JUsing advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/J
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Level DB - Quick Cheat Sheet
Level DB - Quick Cheat SheetLevel DB - Quick Cheat Sheet
Level DB - Quick Cheat Sheet
 
Proxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteProxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynote
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by Step
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web app
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 

Viewers also liked

Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochure
guest1b72b3
 

Viewers also liked (10)

Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
DLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationDLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendation
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochure
 
RESTafarian-ism at the NLA
RESTafarian-ism at the NLARESTafarian-ism at the NLA
RESTafarian-ism at the NLA
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangle
 
Como evitar enfermarse
Como evitar enfermarseComo evitar enfermarse
Como evitar enfermarse
 
CouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeCouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilege
 
The Future of Vertical Search Engines
The Future of Vertical Search EnginesThe Future of Vertical Search Engines
The Future of Vertical Search Engines
 
BABOK Study Group - meeting 4
BABOK Study Group - meeting 4BABOK Study Group - meeting 4
BABOK Study Group - meeting 4
 

Similar to LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 

Similar to LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene (20)

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositories
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 

More from eby

More from eby (18)

Using a CSS Framework
Using a CSS FrameworkUsing a CSS Framework
Using a CSS Framework
 
XForms for Metadata creation
XForms for Metadata creationXForms for Metadata creation
XForms for Metadata creation
 
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
 
ÖpënÜRL
ÖpënÜRLÖpënÜRL
ÖpënÜRL
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehills
 
Zotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic WebZotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic Web
 
Creating an Academic Image Collection with Flickr
Creating an Academic Image Collection with FlickrCreating an Academic Image Collection with Flickr
Creating an Academic Image Collection with Flickr
 
From Idea to Open Source
From Idea to Open SourceFrom Idea to Open Source
From Idea to Open Source
 
Obstacles to Agility
Obstacles to AgilityObstacles to Agility
Obstacles to Agility
 
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
 
Code4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher KeynoteCode4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher Keynote
 
Library Data APIs Abound!
Library Data APIs Abound!Library Data APIs Abound!
Library Data APIs Abound!
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendations
 
On The Herding of Cats
On The Herding of CatsOn The Herding of Cats
On The Herding of Cats
 
Code4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch PortalCode4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch Portal
 
Code4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeCode4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's time
 
Evergreen - Future of the ILS
Evergreen - Future of the ILSEvergreen - Future of the ILS
Evergreen - Future of the ILS
 
Lucene Summit - Beth's Presentation
Lucene Summit - Beth's PresentationLucene Summit - Beth's Presentation
Lucene Summit - Beth's Presentation
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

  • 1. LuSql: (Quickly and easily)Getting your data from your DBMS into Lucene Glen Newton CISTI Research, CISTI NRC code4lib 2009 Providence Rhode Island Feb 24 2009
  • 2. Outline • What is LuSql? • Context • Examples • Performance & comparisons • Next version
  • 3. Context • CISTI == Canada National Science Library • Digital Library Research Group • Heavy text mining, knowledge discovery tools, information visualization, citation analysis, recommender systems • Large local text collection: – 8.4M PDFs, full text & metadata (~700GB) – Full text on file system – Metadata in MySql • Team of 4: Lucene expert; 3 needing to use Lucene • Daily creation of some experiment/domain/foo – specific large scale Lucene index
  • 4. LuSql Rationale • Need for low barrier, high performance, flexible tool for Lucene index creation • Choice – SOLR – DBSight – Lucen4DB.net – Hibernate Search – Compass • All one or more of: – Overly complicated for non-Lucene / non-Java / non-XML / non- framework users – Performance/scalability issues – Not Open Source Software (OSS)
  • 5. LuSql • User knowledge: – Knowledge of SQL – Knowledge of their database and tables – Ability to set the Java CLASSPATH in a command-line shell – Ability to run a command line application
  • 6. LuSql Command Line Arguments – Create or append – SQL – JDBC URL – # records to index – Lucene Analyzer class – JDBC driver class – Indexing properties, global or by field – Global term value (i.e. all documents have quot;source=catquot;) – Lucene RAM buffer size – Stop word file – Lucene index directory – Multithreading toggle, #threads – Pluggable Document filter – Subqueries
  • 7.
  • 8. Examples java jar lusql.jar -q quot;select * from Article where volumeYear > 2007quot; -c quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n 5 -l tutorial -I 211 -t
  • 9. Index Term Properties • Index: Default:TOKENIZED – 0:NO – 1:NO_NORMS – 2:TOKENIZED – 3:UN_TOKENIZED • Store: Default:YES – 0:NO – 1:YES – 2:COMPRESS • Term vector: Default:YES – 0:NO – 1:YES
  • 10.
  • 11. Example 2: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 12.
  • 13. Example 2: Large Scale Index Propertries • 6.4M articles (only metadata) • 20 fields, including abstract • Indexing time: 1h 34m • Index size: 21GB
  • 14.
  • 15.
  • 16. Example 3: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 17. Example 4: Out-of-band document manipulation • Plugin architecture allowing arbitrary manipulations of Documents before they go into the index • Implement DocFilter interface • Add filter class at command line: – -f ca.nrc.cisti.lusql.example.FileFullTextFilter – Looks in metadata field for PDF location in file system; finds corresponding .txt file; reads file & adds to Document
  • 18. Example 4: Large Scale Index Properties • 6.4M articles (metadata & full-text), ~600GB PDFs • 21 fields, including abstract & full-text • Indexing time: 13h 46m • Index size: 86GB
  • 19. Comparison to SOLR • SOLR 1.4 November build: – Using DataImportHandler, with all defaults • Lucene 2.4 • LuSql 0.90
  • 20. Hardware • Indexing and database machines: – Dell PowerEdge 1955 Blade server56 – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit – Memory: 8 GB 667MHz – Disk: 2 x 73GB internal 10K RPM SAS drives • Both machines attached to: – Dell EMC AX150 storage arrays – 12 x 500 GB SATA II 7.2K RPM disks – via: – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
  • 21. Software • MySql: v5.0.45 compiled from source. • gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux) • Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0 07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) • Operating System – Linux openSUSE 10.2 (64-bit X86-64) – Linux kernel: 2.6.18.8-0.10-default #1 SMP
  • 24.
  • 25.
  • 26. Acknowledgments • Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
  • 27. LuSql 0.95 in development • Re-architected to have pluggable drivers for both input & output • Read drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier • Write drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI, Terrier • For large volumes, concurrent multiple indexes merged at end
  • 28. Questions • Glen Newton glen.newton@nrc-cnrc.gc.ca