SlideShare a Scribd company logo
Lucene with MySQL
Farhan “Frank” Mashraqi
DBA
Fotolog, Inc.
fmashraqi@fotolog.com
softwareengineer99@yahoo.com
Introduction
 Farhan Mashraqi
 Senior MySQL DBA of Fotolog, Inc.
 Known on Planet MySQL as “Frank Mash”
What is Lucene?
 Started in 1997 “self serving project”
 2001: Apache folks adopts Lucene
 Open Source Information Retrieval (IR) Library
- available from the Apache Software Foundation
- Search and Index any textual data
- Doesn’t care about language, source and format of data
Lucene?
 Not a turnkey search engine
 Standard
- for building open-source based large-scale search
applications
- a high performance, scalable, cross-platform search toolkit
- Today: translated into C++, C#, Perl, Python, Ruby
- for embedded and customizable search
- widely adopted by OEM software vendors and enterprise IT
departments
Lucene
DB
Web
Aggregate DataFile System
Index Documents
Index
Search Index
Search
Results
User Query
LUCENE
Application
What types of queries it supports
 Single and multi-term queries
 Phrase queries
 Wildcards
 Result ranking
 +apple –computer +pie
 country:USA
 country:USA AND state:CA
Cons
 Need Java resources (programmers)
- JSP experience plus
 Implementation and Maintenance Cost
 By default
- No installer or wizard for setup (it’s a toolkit )
- No administration or command line tools (demo avail.)
- No spider
- Coding yourself is always an option
- No complex script language support by default
- 3rd
party tools available
Cons 2
- No built-in support for (Demos avail. for how to implement)
- HTML format
- PDF format
- Microsoft Office Documents
- Advanced XML queries
- “How tos” available.
- No database gateway
- Integrates with MySQL with little work
- Web interface
- JSP sample available
- Missing enterprise support
Lucene Libraries
1. The Lucene libraries include core search components such
as a document indexer, index searcher, query parser, and
text analyzer.
Who is behind Lucene?
 Doug Cutting (Author)
Previously at Excite
 Apache Software Foundation
Who uses Lucene?
 IBM
- IBM OmniFind Yahoo! Edition
 CNET
- http://reviews.cnet.com/
- http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html
 Wikipedia
 Fedex
 Akamai’s EdgeComputing platform
 Technorati
 FURL
 Sun
- Open Solaris Source Browser
When to use Lucene?
 Search applications
 Search functionality for existing applications
 Search enabling database application
When not to use?
 Not ideal for
- Adding generic search to site
- Enterprise systems needing support for proprietary formats
- Extremely high volume systems
- Through a better architecture this can be solved
- Investigate carefully if
- You need more than 100 QPS per system
- Highly volatile data
- Updates are actually Deletes and Additions
- Additions visible to new sessions only
Why Lucene?
 What problems does Lucene solve?
- Full text with MySQL
- Pros and Cons
 Powerful features
 Simple API
 Scalable, cost-effective, efficient Indexing
- Powerful Searching through multiple query types
Powerful features
 Simple API
- Sort by any field
- Simultaneous updates and searching
Core Index Classes
 IndexWriter
 Directory
 Analyzer
 Document
 Field
IndexWriter
 IndexWriter
- Creates new index
- Adds document to new index
- Gives you “write” access but no “read” access
- Not the only class used to modify an index
- Lucene API can be used as well
Directory
 Directory
- Represents location of the Lucene Index
- Abstract class
- Allows its subclasses to store the index as they see fit
- FSDirectory
- RAMDirectory
- Interface Identical to FSDirectory
Analyzer
 Analyzer
- Text passed through analyzer before indexing
- Specified in the IndexWriter constructor
- Incharge of extracting tokens out of text to be indexed
- Rest is eliminated
- Several implementation available (stop words, lower case
etc)
Document
 Document
- Collection of fields (virtual document)
- Chunk of data
- Fields of a document represent the document or meta-data
associated with that document
- -Original source of Document data (word PDF) irrelevant
- Metadata indexed and stored separately as fields of a
document
- Text only: java.lang.String and java.io.Reader are the only
things handled by core
Field 1
 Field
- Document in an index contains one or more fields (in a class called Field)
- Each field represents data that is either queried against or retrieved from index during
search.
- Four different types:
- Keyword
- Isn’t analyzed
- But indexed and stored in the index
- Ideal for:
- URLs
- Paths
- SSN
- Names
- Orginal value is reserved in entirety
Field types
- Unindexed
- Neither analyzed nor indexed
- Value stored in index as is
- Fields that need to be displayed with search results (URL
etc)
- But you won’t search based on these fields
- Because original values are stored
- Don’t store fields with very large values
- Especially if index size will be an issue
Field types
- Unstored
- Opposite of UnIndexed
- Field type is analyzed and indexed but isn’t stored in the
index
- Suitable for indexing a large amount of text that’s not going
to be needed in original form
- E.g.
- HTML of a webpage etc
Field types
- Text
- Analyzed and indexed
- Field of this type can be searched against
- Be careful about the field size
- If data indexed is String, it will be stored
- If Data is from a Reader
- It will not be stored
Note:
 Field.Text(String, String) and Field.Text(String, Reader) are
different.
- (String, String) stores the field data
- (String, Reader) does not
 To index a String, but not store it, use
- Field.UnStored(String, String)
Classes for Basic Search Operations
 IndexSearcher
- Opens an index in read-only mode
- Offers a number of search methods
- Some of which implemented in Searcher class
IndexSearcher is = new IndexSearcher(
FSDirectory.getDirectory("/tmp/index", false));
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
Classes for Basic Search Operations
 Term
- Basic unit for searching
- Consists of pair of string elements: name of field and value
of field
- Term objects are involved in indexing process
- Term objects can be constructed and used with TermQUery
Query q = new TermQuery(new Term("contents",
"lucene"));
Hits hits = is.search(q);
Classes for Basic Search Operations
 Query
- A number of query subclasses
- BooleanQuery
- PhraseQuery
- PrefixQuery
- PhrasePrefixQuery
- RangeQuery
- FilteredQuery
- SpanQuery
Classes for Basic Search Operations
 TermQuery
- Most basic type of query supported by Lucene
- Used for matching documents that contain fields with
specific values
 Hits
- Simple container of pointers to ranked search results.
- Hits instances don’t load from index all documents that
match a query but only a small portion (performance)
Indexing
 Multiple type indexing
- Scalable
- High Performance
- “over 20MB/minute on Pentium M 1.5GHz”
- Incremental indexing and batch indexing have same cost
- Index Size
- index size roughly 20-30% the size of text indexed
- Compare to MySQL’s FULL-TEXT index size
- Cost-effective
- 1 MB heap (small RAM needed)
Powerful Searching & Sorting
- Ranked Searching
- Multiple Powerful Query Types
- phrase queries, wildcard queries, proximity queries, range
queries and more
- Fielded Searching
- fielded searching (e.g., title, author, contents)
- Date Range Searching
- date-range searching
- Multiple Index Searching with Merged Results
- Sort by any field
How to Integrate Your Application With Lucene
 Install JDK (5 or 6)
 Testing Lucene Demo
Prerequisites: JDK
 Installing JDK
- For downloading visit the JDK5
http://java.sun.com/javase/downloads/index_jdk5.jsp page
- or JDK 6 download page
http://java.sun.com/javase/downloads/index.jsp
- Once downloaded:
- Change Permissions
- [root@srv31 jdk-install]# chmod 755 jdk-1_5_0_09-linux-
i586.bin
- Install
- [root@srv31 jdk-install]# ./jdk-1_5_0_09-linux-i586.bin
Testing Lucene Demo
 Step 2: Testing Lucene Demo
- Set up your environment
- vi /root/.bashrc
- export PATH=/var/www/html/java/jdk1.5.0_09/bin:$PATH
export
CLASSPATH=.:/var/www/html/java/jdk1.5.0_09:/var/www/html/java/jdk1.5.0_09/lib:/var/www/html/jav
a/jdk1.5.0_09/lib/lucene-2.1.0/lucene-core-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/lucene-
2.1.0/lucene-demos-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
- Now get and place in /var/www/html/java/jdk1.5.0_09/lib/lucene-2.1.0/
- Lucene Java
- http://www.apache.org/dyn/closer.cgi/lucene/java/
- XMLRPC Library
- [root@srv31 lib]# wget http://mirror.candidhosting.com/pub/apache/lucene/java/lucene-2.1.0.zip
[root@srv31 lib]# unzip lucene-2.1.0.zip
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-core-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p lucene-2.1.0/lucene-demos-2.1.0.jar ../lib/
[root@srv31 lib]# cp -p /var/www/html/java/jdk1.5.0_06/lib/xmlrpc-3.0a1.jar
/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar
Now "dot" the above file:
[root@srv31 lib]# . /root/.bashrc
Testing Lucene Demo 2
- Believe it or not, we are now ready to test the Lucene Demo.
- Indexing
- I just let it loose on a randomly picked directory to give you an
idea:
[root@srv31 lib]# java org.apache.lucene.demo.IndexFiles
/var/www/html/java/jdk1.5.0_09/
adding /var/www/html/java/jdk1.5.0_09/include/jni.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jawt_md.h
adding /var/www/html/java/jdk1.5.0_09/include/linux/jni_md.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmti.h
adding /var/www/html/java/jdk1.5.0_09/include/jvmdi.h
Optimizing...
157013 total milliseconds
Testing Lucene Demo 3
 [root@srv31 lib]# java org.apache.lucene.demo.SearchFiles
 Query: java
 Searching for: java
 1159 total matching documents
 1. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh.GBK/LC_MESSAGES/sunw_java_plugin.mo
 2. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh/LC_MESSAGES/sunw_java_plugin.mo
 3. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/ko/LC_MESSAGES/sunw_java_plugin.mo
 4. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_HK.BIG5HK/LC_MESSAGES/sunw_java_plugin.mo
 5. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW.BIG5/LC_MESSAGES/sunw_java_plugin.mo
 6. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW/LC_MESSAGES/sunw_java_plugin.mo
 7. /var/www/html/java/jdk1.5.0_09/demo/jfc/Stylepad/README.txt
 8. /var/www/html/java/jdk1.5.0_09/demo/jfc/Notepad/README.txt
 9. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Stylepad/README.txt
 10. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Notepad/README.txt
 more (y/n) ?
Loading data from MySQL
 …
 String url = "jdbc:mysql://127.0.0.1/odp";
 Connection con = DriverManager.getConnection(url, “user",
“pass");
 Statement Stmt = con.createStatement();
 ResultSet RS = Stmt.executeQuery
 ("SELECT * FROM " +
 " articles" );
Loading data from MySQL 2
 while (RS.next()) {
 // System.out.print(""" + RS.getString(1) + """);
 try {
 final Document doc = new Document();
 // create Document
 doc.add(Field.Text("title", RS.getString("title")));
 doc.add(Field.Text("type", "article"));
 doc.add(Field.Text("author",
RS.getString("author")));
 doc.add(Field.Text("body", RS.getString("body")));
 doc.add(Field.Text("extended",
RS.getString("extended")));
 …
Loading data from MySQL 3
 …
 doc.add(Field.Text("tags", RS.getString("tags")));
 doc.add(Field.UnIndexed("permalink", RS.getString("permalink") ));
 doc.add(Field.UnIndexed("id", RS.getString("id")));
 doc.add(Field.UnIndexed("member_id", RS.getString("member_id")));
 doc.add(Field.UnIndexed("portal_id", RS.getString("portal_id")));
 //doc.add(Field.Text("id", RS.getString("id")));
 writer.addDocument(doc);
 }
 catch (IOException e) { System.err.println("Unable to index student"); }
 }
 // close connection
Searching Data using XML RPC
 public static void searchArticles( final String search, final int numberOfResults)
 throws Exception
 {
 final Query query;
 Analyzer analyzer = new StandardAnalyzer();
 query = QueryParser.parse(search, "title", analyzer);
 final ArrayList ids = new ArrayList();
 try {
 final IndexReader reader = IndexReader.open(INDEX_DIR);
 final IndexSearcher searcher = new IndexSearcher(reader);
 final Hits hits = searcher.search(query);
 for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {
 final Document doc = hits.doc(i);
 // id field needs to be added //ids.add(new Integer(doc.getField("id").stringValue()));
 …
Searching Data using XML RPC 2
 …
 ids.add(new Integer(doc.getField("id").stringValue()));
 System.out.println("Found + " + doc.getField("id").stringValue() );
 System.out.println("--Title = " + doc.getField("title").stringValue() );
 System.out.println("--Type = " + doc.getField("type").stringValue() );
 System.out.println("--Body = " + doc.getField("body").stringValue() );
 System.out.println("--Author = " + doc.getField("author").stringValue() );
 System.out.println("--Extended = " + doc.getField("extended").stringValue() );
 System.out.println("--Tags = " + doc.getField("tags").stringValue() );
 System.out.println("--Permalink = " + doc.getField("permalink").stringValue() );
 System.out.println("--Member Id = " + doc.getField("member_id").stringValue()
);
 System.out.println("--Portal Id = " + doc.getField("portal_id").stringValue() );
Searching Data using XML RPC 3
 }
 searcher.close();
 reader.close();
 }
 catch (IOException e) {
 System.out.println("Error while reading student data
from index");
 }
 }
Future of Lucene
 Advanced Linguistics Modules that integrate with Lucene
- Support for complex script languages
- Basis Technologies’ Rosette® Linguistics Platform
- The same linguistic software that powers multilingual web
search on Google, Live.com, Yahoo! and leading enterprise
search engines
- “allows Lucene-based applications to index and search text
in multiple languages concurrently, including complex script
languages such as Arabic, Chinese, Farsi, Japanese and
Korean. “
- www.basistech.com/lucene
What are the ports of Lucene
 Lucene4c - C
 CLucene - C++
 MUTIS - Delphi
 Lucene.Net - a straight C#/.NET port of Lucene by the
Apache Software Foundation, fully compatible with it.
 Plucene - Perl
 Kinosearch - Perl
 Pylucene - Lucene interfaced with a Python front-end
 Ferret and RubyLucene - Ruby
 Zend Framework (Search) - PHP
 Montezuma - Common Lisp
Where to get help about Lucene?
 http://lucene.apache.org/java/docs/mailinglists.html
 IRC
Books about Lucene
 Lucene in Action
- Erik Hatcher and Otis Gospodnetic
Questions?

More Related Content

What's hot

Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
Chris Adkin
 
The MySQL SYS Schema
The MySQL SYS SchemaThe MySQL SYS Schema
The MySQL SYS Schema
Mark Leith
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
enterprisesearchmeetup
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
Newyorksys.com
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
Carlos Sierra
 
My sql 5.6 master slave and master-master replication.step by step configurat...
My sql 5.6 master slave and master-master replication.step by step configurat...My sql 5.6 master slave and master-master replication.step by step configurat...
My sql 5.6 master slave and master-master replication.step by step configurat...
Pawan Kumar
 
Splunk ES Asset & Identity
Splunk ES Asset & IdentitySplunk ES Asset & Identity
Splunk ES Asset & Identity
Vikram Kumar Yadav
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Edureka!
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
CanSecWest
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
SpringPeople
 
Postgresql
PostgresqlPostgresql
Operation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptxOperation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptx
AkshayChavan879707
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Best Practices with ODI : Flexibility
Best Practices with ODI : FlexibilityBest Practices with ODI : Flexibility
Best Practices with ODI : Flexibility
Gurcan Orhan
 
Auditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPASAuditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPAS
EDB
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with SplunkReactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
Splunk
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
Matthew Rocklin
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 

What's hot (20)

Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
 
The MySQL SYS Schema
The MySQL SYS SchemaThe MySQL SYS Schema
The MySQL SYS Schema
 
The Elastic ELK Stack
The Elastic ELK StackThe Elastic ELK Stack
The Elastic ELK Stack
 
Datastage ppt
Datastage pptDatastage ppt
Datastage ppt
 
SQL Tuning 101
SQL Tuning 101SQL Tuning 101
SQL Tuning 101
 
My sql 5.6 master slave and master-master replication.step by step configurat...
My sql 5.6 master slave and master-master replication.step by step configurat...My sql 5.6 master slave and master-master replication.step by step configurat...
My sql 5.6 master slave and master-master replication.step by step configurat...
 
Splunk ES Asset & Identity
Splunk ES Asset & IdentitySplunk ES Asset & Identity
Splunk ES Asset & Identity
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Postgresql
PostgresqlPostgresql
Postgresql
 
Operation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptxOperation Analytics and Investigating Metric Spike.pptx
Operation Analytics and Investigating Metric Spike.pptx
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Best Practices with ODI : Flexibility
Best Practices with ODI : FlexibilityBest Practices with ODI : Flexibility
Best Practices with ODI : Flexibility
 
Auditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPASAuditing and Monitoring PostgreSQL/EPAS
Auditing and Monitoring PostgreSQL/EPAS
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with SplunkReactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
Reactive to Proactive: Intelligent Troubleshooting and Monitoring with Splunk
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 

Viewers also liked

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Lucandra
LucandraLucandra
Lucandra
otisg
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Lucene
LuceneLucene
Lucene
Matt Wood
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
YI-CHING WU
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
lucenerevolution
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
pascaldimassimo
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
otisg
 
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 
Solr
SolrSolr
Solr
sortivo
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
Bertrand Delacretaz
 

Viewers also liked (20)

Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucandra
LucandraLucandra
Lucandra
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Lucene
LuceneLucene
Lucene
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Solr
SolrSolr
Solr
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 

Similar to Lucene and MySQL

Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Ashnikbiz
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
Francisco Gonçalves
 
Solr5
Solr5Solr5
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
Search Engines: Best Practice
Search Engines: Best PracticeSearch Engines: Best Practice
Search Engines: Best Practice
Yuliya_Prach
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Apache solr
Apache solrApache solr
Apache solr
Dipen Rangwani
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
Stefan Schmidt
 
Apache Solr vs Oracle Endeca
Apache Solr vs Oracle EndecaApache Solr vs Oracle Endeca
Apache Solr vs Oracle Endeca
Pedro Melo Pereira
 

Similar to Lucene and MySQL (20)

Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Solr5
Solr5Solr5
Solr5
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
Search Engines: Best Practice
Search Engines: Best PracticeSearch Engines: Best Practice
Search Engines: Best Practice
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Apache solr
Apache solrApache solr
Apache solr
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Oracle by Muhammad Iqbal
Oracle by Muhammad IqbalOracle by Muhammad Iqbal
Oracle by Muhammad Iqbal
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Apache Solr vs Oracle Endeca
Apache Solr vs Oracle EndecaApache Solr vs Oracle Endeca
Apache Solr vs Oracle Endeca
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
 

Recently uploaded

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

Lucene and MySQL

  • 1. Lucene with MySQL Farhan “Frank” Mashraqi DBA Fotolog, Inc. fmashraqi@fotolog.com softwareengineer99@yahoo.com
  • 2. Introduction  Farhan Mashraqi  Senior MySQL DBA of Fotolog, Inc.  Known on Planet MySQL as “Frank Mash”
  • 3. What is Lucene?  Started in 1997 “self serving project”  2001: Apache folks adopts Lucene  Open Source Information Retrieval (IR) Library - available from the Apache Software Foundation - Search and Index any textual data - Doesn’t care about language, source and format of data
  • 4. Lucene?  Not a turnkey search engine  Standard - for building open-source based large-scale search applications - a high performance, scalable, cross-platform search toolkit - Today: translated into C++, C#, Perl, Python, Ruby - for embedded and customizable search - widely adopted by OEM software vendors and enterprise IT departments
  • 5. Lucene DB Web Aggregate DataFile System Index Documents Index Search Index Search Results User Query LUCENE Application
  • 6. What types of queries it supports  Single and multi-term queries  Phrase queries  Wildcards  Result ranking  +apple –computer +pie  country:USA  country:USA AND state:CA
  • 7. Cons  Need Java resources (programmers) - JSP experience plus  Implementation and Maintenance Cost  By default - No installer or wizard for setup (it’s a toolkit ) - No administration or command line tools (demo avail.) - No spider - Coding yourself is always an option - No complex script language support by default - 3rd party tools available
  • 8. Cons 2 - No built-in support for (Demos avail. for how to implement) - HTML format - PDF format - Microsoft Office Documents - Advanced XML queries - “How tos” available. - No database gateway - Integrates with MySQL with little work - Web interface - JSP sample available - Missing enterprise support
  • 9. Lucene Libraries 1. The Lucene libraries include core search components such as a document indexer, index searcher, query parser, and text analyzer.
  • 10. Who is behind Lucene?  Doug Cutting (Author) Previously at Excite  Apache Software Foundation
  • 11. Who uses Lucene?  IBM - IBM OmniFind Yahoo! Edition  CNET - http://reviews.cnet.com/ - http://www.mail-archive.com/java-user@lucene.apache.org/msg02645.html  Wikipedia  Fedex  Akamai’s EdgeComputing platform  Technorati  FURL  Sun - Open Solaris Source Browser
  • 12. When to use Lucene?  Search applications  Search functionality for existing applications  Search enabling database application
  • 13. When not to use?  Not ideal for - Adding generic search to site - Enterprise systems needing support for proprietary formats - Extremely high volume systems - Through a better architecture this can be solved - Investigate carefully if - You need more than 100 QPS per system - Highly volatile data - Updates are actually Deletes and Additions - Additions visible to new sessions only
  • 14. Why Lucene?  What problems does Lucene solve? - Full text with MySQL - Pros and Cons  Powerful features  Simple API  Scalable, cost-effective, efficient Indexing - Powerful Searching through multiple query types
  • 15. Powerful features  Simple API - Sort by any field - Simultaneous updates and searching
  • 16. Core Index Classes  IndexWriter  Directory  Analyzer  Document  Field
  • 17. IndexWriter  IndexWriter - Creates new index - Adds document to new index - Gives you “write” access but no “read” access - Not the only class used to modify an index - Lucene API can be used as well
  • 18. Directory  Directory - Represents location of the Lucene Index - Abstract class - Allows its subclasses to store the index as they see fit - FSDirectory - RAMDirectory - Interface Identical to FSDirectory
  • 19. Analyzer  Analyzer - Text passed through analyzer before indexing - Specified in the IndexWriter constructor - Incharge of extracting tokens out of text to be indexed - Rest is eliminated - Several implementation available (stop words, lower case etc)
  • 20. Document  Document - Collection of fields (virtual document) - Chunk of data - Fields of a document represent the document or meta-data associated with that document - -Original source of Document data (word PDF) irrelevant - Metadata indexed and stored separately as fields of a document - Text only: java.lang.String and java.io.Reader are the only things handled by core
  • 21. Field 1  Field - Document in an index contains one or more fields (in a class called Field) - Each field represents data that is either queried against or retrieved from index during search. - Four different types: - Keyword - Isn’t analyzed - But indexed and stored in the index - Ideal for: - URLs - Paths - SSN - Names - Orginal value is reserved in entirety
  • 22. Field types - Unindexed - Neither analyzed nor indexed - Value stored in index as is - Fields that need to be displayed with search results (URL etc) - But you won’t search based on these fields - Because original values are stored - Don’t store fields with very large values - Especially if index size will be an issue
  • 23. Field types - Unstored - Opposite of UnIndexed - Field type is analyzed and indexed but isn’t stored in the index - Suitable for indexing a large amount of text that’s not going to be needed in original form - E.g. - HTML of a webpage etc
  • 24. Field types - Text - Analyzed and indexed - Field of this type can be searched against - Be careful about the field size - If data indexed is String, it will be stored - If Data is from a Reader - It will not be stored
  • 25. Note:  Field.Text(String, String) and Field.Text(String, Reader) are different. - (String, String) stores the field data - (String, Reader) does not  To index a String, but not store it, use - Field.UnStored(String, String)
  • 26. Classes for Basic Search Operations  IndexSearcher - Opens an index in read-only mode - Offers a number of search methods - Some of which implemented in Searcher class IndexSearcher is = new IndexSearcher( FSDirectory.getDirectory("/tmp/index", false)); Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
  • 27. Classes for Basic Search Operations  Term - Basic unit for searching - Consists of pair of string elements: name of field and value of field - Term objects are involved in indexing process - Term objects can be constructed and used with TermQUery Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
  • 28. Classes for Basic Search Operations  Query - A number of query subclasses - BooleanQuery - PhraseQuery - PrefixQuery - PhrasePrefixQuery - RangeQuery - FilteredQuery - SpanQuery
  • 29. Classes for Basic Search Operations  TermQuery - Most basic type of query supported by Lucene - Used for matching documents that contain fields with specific values  Hits - Simple container of pointers to ranked search results. - Hits instances don’t load from index all documents that match a query but only a small portion (performance)
  • 30. Indexing  Multiple type indexing - Scalable - High Performance - “over 20MB/minute on Pentium M 1.5GHz” - Incremental indexing and batch indexing have same cost - Index Size - index size roughly 20-30% the size of text indexed - Compare to MySQL’s FULL-TEXT index size - Cost-effective - 1 MB heap (small RAM needed)
  • 31. Powerful Searching & Sorting - Ranked Searching - Multiple Powerful Query Types - phrase queries, wildcard queries, proximity queries, range queries and more - Fielded Searching - fielded searching (e.g., title, author, contents) - Date Range Searching - date-range searching - Multiple Index Searching with Merged Results - Sort by any field
  • 32. How to Integrate Your Application With Lucene  Install JDK (5 or 6)  Testing Lucene Demo
  • 33. Prerequisites: JDK  Installing JDK - For downloading visit the JDK5 http://java.sun.com/javase/downloads/index_jdk5.jsp page - or JDK 6 download page http://java.sun.com/javase/downloads/index.jsp - Once downloaded: - Change Permissions - [root@srv31 jdk-install]# chmod 755 jdk-1_5_0_09-linux- i586.bin - Install - [root@srv31 jdk-install]# ./jdk-1_5_0_09-linux-i586.bin
  • 34. Testing Lucene Demo  Step 2: Testing Lucene Demo - Set up your environment - vi /root/.bashrc - export PATH=/var/www/html/java/jdk1.5.0_09/bin:$PATH export CLASSPATH=.:/var/www/html/java/jdk1.5.0_09:/var/www/html/java/jdk1.5.0_09/lib:/var/www/html/jav a/jdk1.5.0_09/lib/lucene-2.1.0/lucene-core-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/lucene- 2.1.0/lucene-demos-2.1.0.jar:/var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar - Now get and place in /var/www/html/java/jdk1.5.0_09/lib/lucene-2.1.0/ - Lucene Java - http://www.apache.org/dyn/closer.cgi/lucene/java/ - XMLRPC Library - [root@srv31 lib]# wget http://mirror.candidhosting.com/pub/apache/lucene/java/lucene-2.1.0.zip [root@srv31 lib]# unzip lucene-2.1.0.zip [root@srv31 lib]# cp -p lucene-2.1.0/lucene-core-2.1.0.jar ../lib/ [root@srv31 lib]# cp -p lucene-2.1.0/lucene-demos-2.1.0.jar ../lib/ [root@srv31 lib]# cp -p /var/www/html/java/jdk1.5.0_06/lib/xmlrpc-3.0a1.jar /var/www/html/java/jdk1.5.0_09/lib/xmlrpc-3.0a1.jar Now "dot" the above file: [root@srv31 lib]# . /root/.bashrc
  • 35. Testing Lucene Demo 2 - Believe it or not, we are now ready to test the Lucene Demo. - Indexing - I just let it loose on a randomly picked directory to give you an idea: [root@srv31 lib]# java org.apache.lucene.demo.IndexFiles /var/www/html/java/jdk1.5.0_09/ adding /var/www/html/java/jdk1.5.0_09/include/jni.h adding /var/www/html/java/jdk1.5.0_09/include/linux/jawt_md.h adding /var/www/html/java/jdk1.5.0_09/include/linux/jni_md.h adding /var/www/html/java/jdk1.5.0_09/include/jvmti.h adding /var/www/html/java/jdk1.5.0_09/include/jvmdi.h Optimizing... 157013 total milliseconds
  • 36. Testing Lucene Demo 3  [root@srv31 lib]# java org.apache.lucene.demo.SearchFiles  Query: java  Searching for: java  1159 total matching documents  1. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh.GBK/LC_MESSAGES/sunw_java_plugin.mo  2. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh/LC_MESSAGES/sunw_java_plugin.mo  3. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/ko/LC_MESSAGES/sunw_java_plugin.mo  4. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_HK.BIG5HK/LC_MESSAGES/sunw_java_plugin.mo  5. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW.BIG5/LC_MESSAGES/sunw_java_plugin.mo  6. /var/www/html/java/jdk1.5.0_09/jre/lib/locale/zh_TW/LC_MESSAGES/sunw_java_plugin.mo  7. /var/www/html/java/jdk1.5.0_09/demo/jfc/Stylepad/README.txt  8. /var/www/html/java/jdk1.5.0_09/demo/jfc/Notepad/README.txt  9. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Stylepad/README.txt  10. /var/www/html/java/jdk1.5.0_09/demo/plugin/jfc/Notepad/README.txt  more (y/n) ?
  • 37. Loading data from MySQL  …  String url = "jdbc:mysql://127.0.0.1/odp";  Connection con = DriverManager.getConnection(url, “user", “pass");  Statement Stmt = con.createStatement();  ResultSet RS = Stmt.executeQuery  ("SELECT * FROM " +  " articles" );
  • 38. Loading data from MySQL 2  while (RS.next()) {  // System.out.print(""" + RS.getString(1) + """);  try {  final Document doc = new Document();  // create Document  doc.add(Field.Text("title", RS.getString("title")));  doc.add(Field.Text("type", "article"));  doc.add(Field.Text("author", RS.getString("author")));  doc.add(Field.Text("body", RS.getString("body")));  doc.add(Field.Text("extended", RS.getString("extended")));  …
  • 39. Loading data from MySQL 3  …  doc.add(Field.Text("tags", RS.getString("tags")));  doc.add(Field.UnIndexed("permalink", RS.getString("permalink") ));  doc.add(Field.UnIndexed("id", RS.getString("id")));  doc.add(Field.UnIndexed("member_id", RS.getString("member_id")));  doc.add(Field.UnIndexed("portal_id", RS.getString("portal_id")));  //doc.add(Field.Text("id", RS.getString("id")));  writer.addDocument(doc);  }  catch (IOException e) { System.err.println("Unable to index student"); }  }  // close connection
  • 40. Searching Data using XML RPC  public static void searchArticles( final String search, final int numberOfResults)  throws Exception  {  final Query query;  Analyzer analyzer = new StandardAnalyzer();  query = QueryParser.parse(search, "title", analyzer);  final ArrayList ids = new ArrayList();  try {  final IndexReader reader = IndexReader.open(INDEX_DIR);  final IndexSearcher searcher = new IndexSearcher(reader);  final Hits hits = searcher.search(query);  for (int i = 0; i != hits.length() && i != numberOfResults; ++i) {  final Document doc = hits.doc(i);  // id field needs to be added //ids.add(new Integer(doc.getField("id").stringValue()));  …
  • 41. Searching Data using XML RPC 2  …  ids.add(new Integer(doc.getField("id").stringValue()));  System.out.println("Found + " + doc.getField("id").stringValue() );  System.out.println("--Title = " + doc.getField("title").stringValue() );  System.out.println("--Type = " + doc.getField("type").stringValue() );  System.out.println("--Body = " + doc.getField("body").stringValue() );  System.out.println("--Author = " + doc.getField("author").stringValue() );  System.out.println("--Extended = " + doc.getField("extended").stringValue() );  System.out.println("--Tags = " + doc.getField("tags").stringValue() );  System.out.println("--Permalink = " + doc.getField("permalink").stringValue() );  System.out.println("--Member Id = " + doc.getField("member_id").stringValue() );  System.out.println("--Portal Id = " + doc.getField("portal_id").stringValue() );
  • 42. Searching Data using XML RPC 3  }  searcher.close();  reader.close();  }  catch (IOException e) {  System.out.println("Error while reading student data from index");  }  }
  • 43. Future of Lucene  Advanced Linguistics Modules that integrate with Lucene - Support for complex script languages - Basis Technologies’ Rosette® Linguistics Platform - The same linguistic software that powers multilingual web search on Google, Live.com, Yahoo! and leading enterprise search engines - “allows Lucene-based applications to index and search text in multiple languages concurrently, including complex script languages such as Arabic, Chinese, Farsi, Japanese and Korean. “ - www.basistech.com/lucene
  • 44. What are the ports of Lucene  Lucene4c - C  CLucene - C++  MUTIS - Delphi  Lucene.Net - a straight C#/.NET port of Lucene by the Apache Software Foundation, fully compatible with it.  Plucene - Perl  Kinosearch - Perl  Pylucene - Lucene interfaced with a Python front-end  Ferret and RubyLucene - Ruby  Zend Framework (Search) - PHP  Montezuma - Common Lisp
  • 45. Where to get help about Lucene?  http://lucene.apache.org/java/docs/mailinglists.html  IRC
  • 46. Books about Lucene  Lucene in Action - Erik Hatcher and Otis Gospodnetic