Building a Search Engine Using Lucene

Building a Search Engine
Using Apache Lucene/Solr

Road Map
• Problem Definition
• A Basic Search Engine Pipeline
• Meet Lucene
• Lucene API Examples
• Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)
• Applied Lucene (Real Examples)

Problem Definition
You got a farm of data, and you want it to be searchable.
Analogy: Searching for a needle in a haystack with adding more hay to
the stack!
- SQL Databases Cons ( > 500,000,000 records …)
- Scalability
- Decentralization

A Basic Search Engine Pipeline
• Crawling: Grapping the data
• Parsing [Optional]: Understanding the data
• Indexing: Build the holding structure
• Ranking: Sort the data
• Searching: Read that holding structure
Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting,
Calculating Term Vectors, Token Filtration,
Index Inversion, etc…

What is Lucene?
• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)
• Free, Java information retrieval library
• Application related: Indexing, Searching
• High performance, A decade of research
• Heavily supported, simply customized
• No dependencies

What Lucene Ain’t
• A complete search engine
• An application
• A crawler
• A document filter/recognizer

Lucene Roles
Rich Document Rich Document
Gather
Parse
Make Doc
Search UI
Search App
e.g. webapp
Search
Index
Index

Lucene Strength Points
• Simple API
• Speed
• Concurrency
• Smart indexing (Incremental)
• Near Real Time Search
• Vector Space Search
• Heavily Used, Supported

Lucene Query Types
• Single Term VS. Multi-Term “+name: camel + type: animal”
• Wildcard Queries “text:wonder*”
• Fuzzy Queries “room~0.8”
• Range Queries “date:[25/5/2000 To *]”
• Grouped Queries “text: animal AND small”
• Proximity Queries “hamlet macbeth”~10
• Boosted Queries “hamlet^5.0 AND macbeth”

API Sample I (Indexing)
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public void index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}

Indexing Pipeline (Simplified)
Tokenizer TokenFilterDocument Document
Writer
Inverted
Index
add

Analysis Basic Types
"The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer :
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer :
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer :
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
"XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]

The Inverted Index (In a nutshell)

API Sample II (Searching)
public void search(String indexDir, String q) throws IOException, ParseException {
Directory dir = FSDirectory.open(new File(indexDir));
IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents",
new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(q);
TopDocs hits = is.search(query, 10);
System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) {
ScoreDoc scoreDoc = hits.scoreDocs[i];
Document doc = is.doc(scoreDoc.doc);
System.out.println(doc.get("filename"));
}
is.close();
}

Index Update
• Lucene doesn’t have an update mechanism. So?
• Incremental Indexing (Index Merging)
• Delete + Add = Update
• Index Optimization

API Sample III (Deleting)
Via IndexReader
void deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term)
Deletes all documents that have a given term indexed.
Via IndexWriter
void deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query)
Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries)
Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term)
Deletes the document(s) containing term.
void deleteDocuments(Term[] terms)
Deletes the document(s) containing any of the terms.

Some Statistics
• Dependent on Lucene.NET (a .NET port of Lucene)
Local Testing (Index, Search are on the same device)
Over Network Testing (File server for index file, Standalone searching workstations)
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~32, 180 MB ~50 -> 300 0.2
40 GB ~360, 2.6 GB ~100 -> 3000 3.2
4.3 GB X,180 MB ~300 -> 700 X
40 GB X, 2.6 GB ~400 -> 4500 X

Lucene Wrappers (Apache Solr)
• A Java wrapper over Lucene
• A web application that can be deployed on any
servlet container (Apache Tomcat, Jetty)
• A REST service
• It has an administration interface
• Built-in configuration with Apache Tika (a repository of parsers)
• Scalable
• Integration with Apache Hadoop, Apache Cassandra

Solr Architecture (The Big Picture)
Note: It includes JSON, PHP, Python,… Not only XML.

Communication with Solr (Sending Docs)
• Direct Connection OR Through APIs (SolrJ, SolrNET)
// make a connection to Solr server
SolrServer server = new HttpSolrServer("http://localhost:8080/solr/");
// prepare a doc
final SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField("id", 1);
doc1.addField("firstName", "First Name");
doc1.addField("lastName", "Last Name");
final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc1);
// add docs to Solr
server.add(docs);
server.commit();

Communication with Solr (Searching)
final SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSortField("firstName", SolrQuery.ORDER.asc);
final QueryResponse rsp = server.query(query);
final SolrDocumentList solrDocumentList = rsp.getResults();
for (final SolrDocument doc : solrDocumentList) {
final String firstName = (String) doc.getFieldValue("firstName");
final String id = (String) doc.getFieldValue("id");
}

Some Statistics
Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with
The pure Lucene.NET model.
Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing
can cause some delay depending on the queuing strategy.
4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203
40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)

Lucene/Solr Users
• Instagram (geo-search API)
• NetFlix (Generic search feature)
• SourceForge (Generic search feature)
• Eclipse (Documentation search)
• LinkedIn (Recently, Job Search)
• Krugle (SourceCode Search)
• Wikipedia (Recently, Generic Content Search)

References
• Manning Lucene in Action (2nd Edition)
• Lucene Main Website
• Another Presentation on SlideShare

Building a Search Engine Using Lucene

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building a Search Engine Using Lucene

Similar to Building a Search Engine Using Lucene (20)

Recently uploaded

Recently uploaded (20)

Building a Search Engine Using Lucene

Editor's Notes