3. Problem Definition
You got a farm of data, and you want it to be searchable.
Analogy: Searching for a needle in a haystack with adding more hay to
the stack!
- SQL Databases Cons ( > 500,000,000 records …)
- Scalability
- Decentralization
4. A Basic Search Engine Pipeline
• Crawling: Grapping the data
• Parsing [Optional]: Understanding the data
• Indexing: Build the holding structure
• Ranking: Sort the data
• Searching: Read that holding structure
Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting,
Calculating Term Vectors, Token Filtration,
Index Inversion, etc…
5. What is Lucene?
• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)
• Free, Java information retrieval library
• Application related: Indexing, Searching
• High performance, A decade of research
• Heavily supported, simply customized
• No dependencies
6. What Lucene Ain’t
• A complete search engine
• An application
• A crawler
• A document filter/recognizer
7. Lucene Roles
Rich Document Rich Document
Gather
Parse
Make Doc
Search UI
Search App
e.g. webapp
Search
Index
Index
8. Lucene Strength Points
• Simple API
• Speed
• Concurrency
• Smart indexing (Incremental)
• Near Real Time Search
• Vector Space Search
• Heavily Used, Supported
9. Lucene Query Types
• Single Term VS. Multi-Term “+name: camel + type: animal”
• Wildcard Queries “text:wonder*”
• Fuzzy Queries “room~0.8”
• Range Queries “date:[25/5/2000 To *]”
• Grouped Queries “text: animal AND small”
• Proximity Queries “hamlet macbeth”~10
• Boosted Queries “hamlet^5.0 AND macbeth”
10. API Sample I (Indexing)
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public void index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
14. API Sample II (Searching)
public void search(String indexDir, String q) throws IOException, ParseException {
Directory dir = FSDirectory.open(new File(indexDir));
IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents",
new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(q);
TopDocs hits = is.search(query, 10);
System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) {
ScoreDoc scoreDoc = hits.scoreDocs[i];
Document doc = is.doc(scoreDoc.doc);
System.out.println(doc.get("filename"));
}
is.close();
}
15. Index Update
• Lucene doesn’t have an update mechanism. So?
• Incremental Indexing (Index Merging)
• Delete + Add = Update
• Index Optimization
16. API Sample III (Deleting)
Via IndexReader
void deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term)
Deletes all documents that have a given term indexed.
Via IndexWriter
void deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query)
Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries)
Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term)
Deletes the document(s) containing term.
void deleteDocuments(Term[] terms)
Deletes the document(s) containing any of the terms.
17. Some Statistics
• Dependent on Lucene.NET (a .NET port of Lucene)
Local Testing (Index, Search are on the same device)
Over Network Testing (File server for index file, Standalone searching workstations)
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~32, 180 MB ~50 -> 300 0.2
40 GB ~360, 2.6 GB ~100 -> 3000 3.2
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB X,180 MB ~300 -> 700 X
40 GB X, 2.6 GB ~400 -> 4500 X
18. Lucene Wrappers (Apache Solr)
• A Java wrapper over Lucene
• A web application that can be deployed on any
servlet container (Apache Tomcat, Jetty)
• A REST service
• It has an administration interface
• Built-in configuration with Apache Tika (a repository of parsers)
• Scalable
• Integration with Apache Hadoop, Apache Cassandra
21. Communication with Solr (Sending Docs)
• Direct Connection OR Through APIs (SolrJ, SolrNET)
// make a connection to Solr server
SolrServer server = new HttpSolrServer("http://localhost:8080/solr/");
// prepare a doc
final SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField("id", 1);
doc1.addField("firstName", "First Name");
doc1.addField("lastName", "Last Name");
final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc1);
// add docs to Solr
server.add(docs);
server.commit();
22. Communication with Solr (Searching)
final SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSortField("firstName", SolrQuery.ORDER.asc);
final QueryResponse rsp = server.query(query);
final SolrDocumentList solrDocumentList = rsp.getResults();
for (final SolrDocument doc : solrDocumentList) {
final String firstName = (String) doc.getFieldValue("firstName");
final String id = (String) doc.getFieldValue("id");
}
23. Some Statistics
Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with
The pure Lucene.NET model.
Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing
can cause some delay depending on the queuing strategy.
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203
40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)