Lucene with MySQL
Farhan “Frank” Mashraqi
DBA
Fotolog, Inc.
fmashraqi@fotolog.com
softwareengineer99@yahoo.com
Introduction
 Farhan Mashraqi
 Senior MySQL DBA of Fotolog, Inc.
 Known on Planet MySQL as “Frank Mash”
What is Lucene?
 Started in 1997 “self serving project”
 2001: Apache folks adopts Lucene
 Open Source Information Retr...
Lucene?
 Not a turnkey search engine
 Standard
- for building open-source based large-scale search
applications
- a high...
Lucene
DB
Web
Aggregate DataFile System
Index Documents
Index
Search Index
Search
Results
User Query
LUCENE
Application
What types of queries it supports
 Single and multi-term queries
 Phrase queries
 Wildcards
 Result ranking
 +apple –...
Cons
 Need Java resources (programmers)
- JSP experience plus
 Implementation and Maintenance Cost
 By default
- No ins...
Cons 2
- No built-in support for (Demos avail. for how to implement)
- HTML format
- PDF format
- Microsoft Office Documen...
Lucene Libraries
1. The Lucene libraries include core search components such
as a document indexer, index searcher, query ...
Who is behind Lucene?
 Doug Cutting (Author)
Previously at Excite
 Apache Software Foundation
Who uses Lucene?
 IBM
- IBM OmniFind Yahoo! Edition
 CNET
- http://reviews.cnet.com/
- http://www.mail-archive.com/java-...
When to use Lucene?
 Search applications
 Search functionality for existing applications
 Search enabling database appl...
When not to use?
 Not ideal for
- Adding generic search to site
- Enterprise systems needing support for proprietary form...
Why Lucene?
 What problems does Lucene solve?
- Full text with MySQL
- Pros and Cons
 Powerful features
 Simple API
 S...
Powerful features
 Simple API
- Sort by any field
- Simultaneous updates and searching
Core Index Classes
 IndexWriter
 Directory
 Analyzer
 Document
 Field
IndexWriter
 IndexWriter
- Creates new index
- Adds document to new index
- Gives you “write” access but no “read” access...
Directory
 Directory
- Represents location of the Lucene Index
- Abstract class
- Allows its subclasses to store the inde...
Analyzer
 Analyzer
- Text passed through analyzer before indexing
- Specified in the IndexWriter constructor
- Incharge o...
Document
 Document
- Collection of fields (virtual document)
- Chunk of data
- Fields of a document represent the documen...
Field 1
 Field
- Document in an index contains one or more fields (in a class called Field)
- Each field represents data ...
Field types
- Unindexed
- Neither analyzed nor indexed
- Value stored in index as is
- Fields that need to be displayed wi...
Field types
- Unstored
- Opposite of UnIndexed
- Field type is analyzed and indexed but isn’t stored in the
index
- Suitab...
Field types
- Text
- Analyzed and indexed
- Field of this type can be searched against
- Be careful about the field size
-...
Note:
 Field.Text(String, String) and Field.Text(String, Reader) are
different.
- (String, String) stores the field data
...
Classes for Basic Search Operations
 IndexSearcher
- Opens an index in read-only mode
- Offers a number of search methods...
Classes for Basic Search Operations
 Term
- Basic unit for searching
- Consists of pair of string elements: name of field...
Classes for Basic Search Operations
 Query
- A number of query subclasses
- BooleanQuery
- PhraseQuery
- PrefixQuery
- Ph...
Classes for Basic Search Operations
 TermQuery
- Most basic type of query supported by Lucene
- Used for matching documen...
Indexing
 Multiple type indexing
- Scalable
- High Performance
- “over 20MB/minute on Pentium M 1.5GHz”
- Incremental ind...
Powerful Searching & Sorting
- Ranked Searching
- Multiple Powerful Query Types
- phrase queries, wildcard queries, proxim...
How to Integrate Your Application With Lucene
 Install JDK (5 or 6)
 Testing Lucene Demo
Prerequisites: JDK
 Installing JDK
- For downloading visit the JDK5
http://java.sun.com/javase/downloads/index_jdk5.jsp p...
Testing Lucene Demo
 Step 2: Testing Lucene Demo
- Set up your environment
- vi /root/.bashrc
- export PATH=/var/www/html...
Testing Lucene Demo 2
- Believe it or not, we are now ready to test the Lucene Demo.
- Indexing
- I just let it loose on a...
Testing Lucene Demo 3
 [root@srv31 lib]# java org.apache.lucene.demo.SearchFiles
 Query: java
 Searching for: java
 11...
Loading data from MySQL
 …
 String url = "jdbc:mysql://127.0.0.1/odp";
 Connection con = DriverManager.getConnection(ur...
Loading data from MySQL 2
 while (RS.next()) {
 // System.out.print(""" + RS.getString(1) + """);
 try {
 final Docume...
Loading data from MySQL 3
 …
 doc.add(Field.Text("tags", RS.getString("tags")));
 doc.add(Field.UnIndexed("permalink", ...
Searching Data using XML RPC
 public static void searchArticles( final String search, final int numberOfResults)
 throws...
Searching Data using XML RPC 2
 …
 ids.add(new Integer(doc.getField("id").stringValue()));
 System.out.println("Found +...
Searching Data using XML RPC 3
 }
 searcher.close();
 reader.close();
 }
 catch (IOException e) {
 System.out.printl...
Future of Lucene
 Advanced Linguistics Modules that integrate with Lucene
- Support for complex script languages
- Basis ...
What are the ports of Lucene
 Lucene4c - C
 CLucene - C++
 MUTIS - Delphi
 Lucene.Net - a straight C#/.NET port of Luc...
Where to get help about Lucene?
 http://lucene.apache.org/java/docs/mailinglists.html
 IRC
Books about Lucene
 Lucene in Action
- Erik Hatcher and Otis Gospodnetic
Questions?
Upcoming SlideShare
Loading in...5
×

Lucene and MySQL

22,769

Published on

Presented by Fotolog. Lucene is a powerful, high-performance, full-featured text search engine library that is written entirely in Java and provides a technology suitable for all size applications requiring full-text search in heterogeneous environments.

In this presentation, Frank Mash shows you how you can use Lucene with MySQL to offer powerful searching capabilities to your stakeholders. The presentation will cover installation, usage. optimization of Lucene, and how to interface a Ruby on Rails application with Lucene using a custom Java server. This session is highly recommended for those looking to add full-text cross-platform, database independent search capability to their application.

Published in: Technology, News & Politics
3 Comments
37 Likes
Statistics
Notes
No Downloads
Views
Total Views
22,769
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
3
Likes
37
Embeds 0
No embeds

No notes for slide

Lucene and MySQL

  1. 1. Lucene with MySQL Farhan “Frank” Mashraqi DBA Fotolog, Inc. fmashraqi@fotolog.com softwareengineer99@yahoo.com
  2. 2. Introduction  Farhan Mashraqi  Senior MySQL DBA of Fotolog, Inc.  Known on Planet MySQL as “Frank Mash”
  3. 3. What is Lucene?  Started in 1997 “self serving project”  2001: Apache folks adopts Lucene  Open Source Information Retrieval (IR) Library - available from the Apache Software Foundation - Search and Index any textual data - Doesn’t care about language, source and format of data
  4. 4. Lucene?  Not a turnkey search engine  Standard - for building open-source based large-scale search applications - a high performance, scalable, cross-platform search toolkit - Today: translated into C++, C#, Perl, Python, Ruby - for embedded and customizable search - widely adopted by OEM software vendors and enterprise IT departments
  5. 5. Lucene DB Web Aggregate DataFile System Index Documents Index Search Index Search Results User Query LUCENE Application
  6. 6. What types of queries it supports  Single and multi-term queries  Phrase queries  Wildcards  Result ranking  +apple –computer +pie  country:USA  country:USA AND state:CA
  7. 7. Cons  Need Java resources (programmers) - JSP experience plus  Implementation and Maintenance Cost  By default - No installer or wizard for setup (it’s a toolkit ) - No administration or command line tools (demo avail.) - No spider - Coding yourself is always an option - No complex script language support by default - 3rd party tools available
  8. 8. Cons 2 - No built-in support for (Demos avail. for how to implement) - HTML format - PDF format - Microsoft Office Documents - Advanced XML queries - “How tos” available. - No database gateway - Integrates with MySQL with little work - Web interface - JSP sample available - Missing enterprise support
  9. 9. Lucene Libraries 1. The Lucene libraries include core search components such as a document indexer, index searcher, query parser, and text analyzer.
  10. 10. Who is behind Lucene?  Doug Cutting (Author) Previously at Excite  Apache Software Foundation
  11. 11. Who uses Lucene?  IBM - IBM OmniFind Yahoo! Edition  CNET - http://reviews.cnet.com/ - http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html  Wikipedia  Fedex  Akamai’s EdgeComputing platform  Technorati  FURL  Sun - Open Solaris Source Browser
  12. 12. When to use Lucene?  Search applications  Search functionality for existing applications  Search enabling database application
  13. 13. When not to use?  Not ideal for - Adding generic search to site - Enterprise systems needing support for proprietary formats - Extremely high volume systems - Through a better architecture this can be solved - Investigate carefully if - You need more than 100 QPS per system - Highly volatile data - Updates are actually Deletes and Additions - Additions visible to new sessions only
  14. 14. Why Lucene?  What problems does Lucene solve? - Full text with MySQL - Pros and Cons  Powerful features  Simple API  Scalable, cost-effective, efficient Indexing - Powerful Searching through multiple query types
  15. 15. Powerful features  Simple API - Sort by any field - Simultaneous updates and searching
  16. 16. Core Index Classes  IndexWriter  Directory  Analyzer  Document  Field
  17. 17. IndexWriter  IndexWriter - Creates new index - Adds document to new index - Gives you “write” access but no “read” access - Not the only class used to modify an index - Lucene API can be used as well
  18. 18. Directory  Directory - Represents location of the Lucene Index - Abstract class - Allows its subclasses to store the index as they see fit - FSDirectory - RAMDirectory - Interface Identical to FSDirectory
  19. 19. Analyzer  Analyzer - Text passed through analyzer before indexing - Specified in the IndexWriter constructor - Incharge of extracting tokens out of text to be indexed - Rest is eliminated - Several implementation available (stop words, lower case etc)
  20. 20. Document  Document - Collection of fields (virtual document) - Chunk of data - Fields of a document represent the document or meta-data associated with that document - -Original source of Document data (word PDF) irrelevant - Metadata indexed and stored separately as fields of a document - Text only: java.lang.String and java.io.Reader are the only things handled by core
  21. 21. Field 1  Field - Document in an index contains one or more fields (in a class called Field) - Each field represents data that is either queried against or retrieved from index during search. - Four different types: - Keyword - Isn’t analyzed - But indexed and stored in the index - Ideal for: - URLs - Paths - SSN - Names - Orginal value is reserved in entirety
  22. 22. Field types - Unindexed - Neither analyzed nor indexed - Value stored in index as is - Fields that need to be displayed with search results (URL etc) - But you won’t search based on these fields - Because original values are stored - Don’t store fields with very large values - Especially if index size will be an issue
  23. 23. Field types - Unstored - Opposite of UnIndexed - Field type is analyzed and indexed but isn’t stored in the index - Suitable for indexing a large amount of text that’s not going to be needed in original form - E.g. - HTML of a webpage etc
  24. 24. Field types - Text - Analyzed and indexed - Field of this type can be searched against - Be careful about the field size - If data indexed is String, it will be stored - If Data is from a Reader - It will not be stored
  25. 25. Note:  Field.Text(String, String) and Field.Text(String, Reader) are different. - (String, String) stores the field data - (String, Reader) does not  To index a String, but not store it, use - Field.UnStored(String, String)
  26. 26. Classes for Basic Search Operations  IndexSearcher - Opens an index in read-only mode - Offers a number of search methods - Some of which implemented in Searcher class IndexSearcher is = new IndexSearcher( FSDirectory.getDirectory("/tmp/index", false)); Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
  27. 27. Classes for Basic Search Operations  Term - Basic unit for searching - Consists of pair of string elements: name of field and value of field - Term objects are involved in indexing process - Term objects can be constructed and used with TermQUery Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
  28. 28. Classes for Basic Search Operations  Query - A number of query subclasses - BooleanQuery - PhraseQuery - PrefixQuery - PhrasePrefixQuery - RangeQuery - FilteredQuery - SpanQuery
  29. 29. Classes for Basic Search Operations  TermQuery - Most basic type of query supported by Lucene - Used for matching documents that contain fields with specific values  Hits - Simple container of pointers to ranked search results. - Hits instances don’t load from index all documents that match a query but only a small portion (performance)
  30. 30. Indexing  Multiple type indexing - Scalable - High Performance - “over 20MB/minute on Pentium M 1.5GHz” - Incremental indexing and batch indexing have same cost - Index Size - index size roughly 20-30% the size of text indexed - Compare to MySQL’s FULL-TEXT index size - Cost-effective - 1 MB heap (small RAM needed)
  31. 31. Powerful Searching & Sorting - Ranked Searching - Multiple Powerful Query Types - phrase queries, wildcard queries, proximity queries, range queries and more - Fielded Searching - fielded searching (e.g., title, author, contents) - Date Range Searching - date-range searching - Multiple Index Searching with Merged Results - Sort by any field
  32. 32. How to Integrate Your Application With Lucene  Install JDK (5 or 6)  Testing Lucene Demo
  33. 33. Prerequisites: JDK  Installing JDK - For downloading visit the JDK5 http://java.sun.com/javase/downloads/index_jdk5.jsp page - or JDK 6 download page http://java.sun.com/javase/downloads/index.jsp - Once downloaded: - Change Permissions - [root@srv31 jdk-install]# chmod 755 jdk-1_5_0_09-linux- i586.bin - Install - [root@srv31 jdk-install]# ./jdk-1_5_0_09-linux-i586.bin
  34. 34. Testing Lucene Demo  Step 2: Testing Lucene Demo - Set up your environment - vi /root/.bashrc - export PATH=/var/www/html/java/jdk1.5.0_09/bin:$PATH export CLASSPATH=.:/var/www/html/java/jdk1.5.0_09:/var/www/html/java/jdk1.5.0_09/lib:/var/www/html/jav a/jdk1.5.0_09/lib/lucene-2.1.0/lucene-core-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/lucene- 2.1.0/lucene-demos-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar - Now get and place in /var/www/html/java/jdk1.5.0_09/lib/lucene-2.1.0/ - Lucene Java - http://www.apache.org/dyn/closer.cgi/lucene/java/ - XMLRPC Library - [root@srv31 lib]# wget http://mirror.candidhosting.com/pub/apache/lucene/java/lucene-2.1.0.zip [root@srv31 lib]# unzip lucene-2.1.0.zip [root@srv31 lib]# cp -p lucene-2.1.0/lucene-core-2.1.0.jar ../lib/ [root@srv31 lib]# cp -p lucene-2.1.0/lucene-demos-2.1.0.jar ../lib/ [root@srv31 lib]# cp -p /var/www/html/java/jdk1.5.0_06/lib/xmlrpc-3.0a1.jar /var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar Now "dot" the above file: [root@srv31 lib]# . /root/.bashrc
  35. 35. Testing Lucene Demo 2 - Believe it or not, we are now ready to test the Lucene Demo. - Indexing - I just let it loose on a randomly picked directory to give you an idea: [root@srv31 lib]# java org.apache.lucene.demo.IndexFiles /var/www/html/java/jdk1.5.0_09/ adding /var/www/html/java/jdk1.5.0_09/include/jni.h adding /var/www/html/java/jdk1.5.0_09/include/linux/jawt_md.h adding /var/www/html/java/jdk1.5.0_09/include/linux/jni_md.h adding /var/www/html/java/jdk1.5.0_09/include/jvmti.h adding /var/www/html/java/jdk1.5.0_09/include/jvmdi.h Optimizing... 157013 total milliseconds
  36. 36. Testing Lucene Demo 3  [root@srv31 lib]# java org.apache.lucene.demo.SearchFiles  Query: java  Searching for: java  1159 total matching documents  1. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh.GBK/LC_MESSAGES/sunw_java_plugin.mo  2. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh/LC_MESSAGES/sunw_java_plugin.mo  3. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/ko/LC_MESSAGES/sunw_java_plugin.mo  4. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_HK.BIG5HK/LC_MESSAGES/sunw_java_plugin.mo  5. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW.BIG5/LC_MESSAGES/sunw_java_plugin.mo  6. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW/LC_MESSAGES/sunw_java_plugin.mo  7. /var/www/html/java/jdk1.5.0_09/demo/jfc/Stylepad/README.txt  8. /var/www/html/java/jdk1.5.0_09/demo/jfc/Notepad/README.txt  9. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Stylepad/README.txt  10. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Notepad/README.txt  more (y/n) ?
  37. 37. Loading data from MySQL  …  String url = "jdbc:mysql://127.0.0.1/odp";  Connection con = DriverManager.getConnection(url, “user", “pass");  Statement Stmt = con.createStatement();  ResultSet RS = Stmt.executeQuery  ("SELECT * FROM " +  " articles" );
  38. 38. Loading data from MySQL 2  while (RS.next()) {  // System.out.print(""" + RS.getString(1) + """);  try {  final Document doc = new Document();  // create Document  doc.add(Field.Text("title", RS.getString("title")));  doc.add(Field.Text("type", "article"));  doc.add(Field.Text("author", RS.getString("author")));  doc.add(Field.Text("body", RS.getString("body")));  doc.add(Field.Text("extended", RS.getString("extended")));  …
  39. 39. Loading data from MySQL 3  …  doc.add(Field.Text("tags", RS.getString("tags")));  doc.add(Field.UnIndexed("permalink", RS.getString("permalink") ));  doc.add(Field.UnIndexed("id", RS.getString("id")));  doc.add(Field.UnIndexed("member_id", RS.getString("member_id")));  doc.add(Field.UnIndexed("portal_id", RS.getString("portal_id")));  //doc.add(Field.Text("id", RS.getString("id")));  writer.addDocument(doc);  }  catch (IOException e) { System.err.println("Unable to index student"); }  }  // close connection
  40. 40. Searching Data using XML RPC  public static void searchArticles( final String search, final int numberOfResults)  throws Exception  {  final Query query;  Analyzer analyzer = new StandardAnalyzer();  query = QueryParser.parse(search, "title", analyzer);  final ArrayList ids = new ArrayList();  try {  final IndexReader reader = IndexReader.open(INDEX_DIR);  final IndexSearcher searcher = new IndexSearcher(reader);  final Hits hits = searcher.search(query);  for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {  final Document doc = hits.doc(i);  // id field needs to be added //ids.add(new Integer(doc.getField("id").stringValue()));  …
  41. 41. Searching Data using XML RPC 2  …  ids.add(new Integer(doc.getField("id").stringValue()));  System.out.println("Found + " + doc.getField("id").stringValue() );  System.out.println("--Title = " + doc.getField("title").stringValue() );  System.out.println("--Type = " + doc.getField("type").stringValue() );  System.out.println("--Body = " + doc.getField("body").stringValue() );  System.out.println("--Author = " + doc.getField("author").stringValue() );  System.out.println("--Extended = " + doc.getField("extended").stringValue() );  System.out.println("--Tags = " + doc.getField("tags").stringValue() );  System.out.println("--Permalink = " + doc.getField("permalink").stringValue() );  System.out.println("--Member Id = " + doc.getField("member_id").stringValue() );  System.out.println("--Portal Id = " + doc.getField("portal_id").stringValue() );
  42. 42. Searching Data using XML RPC 3  }  searcher.close();  reader.close();  }  catch (IOException e) {  System.out.println("Error while reading student data from index");  }  }
  43. 43. Future of Lucene  Advanced Linguistics Modules that integrate with Lucene - Support for complex script languages - Basis Technologies’ Rosette® Linguistics Platform - The same linguistic software that powers multilingual web search on Google, Live.com, Yahoo! and leading enterprise search engines - “allows Lucene-based applications to index and search text in multiple languages concurrently, including complex script languages such as Arabic, Chinese, Farsi, Japanese and Korean. “ - www.basistech.com/lucene
  44. 44. What are the ports of Lucene  Lucene4c - C  CLucene - C++  MUTIS - Delphi  Lucene.Net - a straight C#/.NET port of Lucene by the Apache Software Foundation, fully compatible with it.  Plucene - Perl  Kinosearch - Perl  Pylucene - Lucene interfaced with a Python front-end  Ferret and RubyLucene - Ruby  Zend Framework (Search) - PHP  Montezuma - Common Lisp
  45. 45. Where to get help about Lucene?  http://lucene.apache.org/java/docs/mailinglists.html  IRC
  46. 46. Books about Lucene  Lucene in Action - Erik Hatcher and Otis Gospodnetic
  47. 47. Questions?

×