SlideShare a Scribd company logo
1 of 26
Building a Search Engine
Using Apache Lucene/Solr
Road Map
• Problem Definition
• A Basic Search Engine Pipeline
• Meet Lucene
• Lucene API Examples
• Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….)
• Applied Lucene (Real Examples)
Problem Definition
You got a farm of data, and you want it to be searchable.
Analogy: Searching for a needle in a haystack with adding more hay to
the stack!
- SQL Databases Cons ( > 500,000,000 records …)
- Scalability
- Decentralization
A Basic Search Engine Pipeline
• Crawling: Grapping the data
• Parsing [Optional]: Understanding the data
• Indexing: Build the holding structure
• Ranking: Sort the data
• Searching: Read that holding structure
Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting,
Calculating Term Vectors, Token Filtration,
Index Inversion, etc…
What is Lucene?
• Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006)
• Free, Java information retrieval library
• Application related: Indexing, Searching
• High performance, A decade of research
• Heavily supported, simply customized
• No dependencies
What Lucene Ain’t
• A complete search engine
• An application
• A crawler
• A document filter/recognizer
Lucene Roles
Rich Document Rich Document
Gather
Parse
Make Doc
Search UI
Search App
e.g. webapp
Search
Index
Index
Lucene Strength Points
• Simple API
• Speed
• Concurrency
• Smart indexing (Incremental)
• Near Real Time Search
• Vector Space Search
• Heavily Used, Supported
Lucene Query Types
• Single Term VS. Multi-Term “+name: camel + type: animal”
• Wildcard Queries “text:wonder*”
• Fuzzy Queries “room~0.8”
• Range Queries “date:[25/5/2000 To *]”
• Grouped Queries “text: animal AND small”
• Proximity Queries “hamlet macbeth”~10
• Boosted Queries “hamlet^5.0 AND macbeth”
API Sample I (Indexing)
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws IOException {
writer.close();
}
public void index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
Indexing Pipeline (Simplified)
Tokenizer TokenFilterDocument Document
Writer
Inverted
Index
add
Analysis Basic Types
"The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer :
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer :
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer :
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
"XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
The Inverted Index (In a nutshell)
API Sample II (Searching)
public void search(String indexDir, String q) throws IOException, ParseException {
Directory dir = FSDirectory.open(new File(indexDir));
IndexSearcher is = new IndexSearcher(dir, true);
QueryParser parser = new QueryParser("contents",
new StandardAnalyzer(Version.LUCENE_CURRENT));
Query query = parser.parse(q);
TopDocs hits = is.search(query, 10);
System.err.println("Found " + hits.totalHits + " document(s)");
for (int i=0; i<hits.scoreDocs.length; i++) {
ScoreDoc scoreDoc = hits.scoreDocs[i];
Document doc = is.doc(scoreDoc.doc);
System.out.println(doc.get("filename"));
}
is.close();
}
Index Update
• Lucene doesn’t have an update mechanism. So?
• Incremental Indexing (Index Merging)
• Delete + Add = Update
• Index Optimization
API Sample III (Deleting)
Via IndexReader
void deleteDocument(int docNum)
Deletes the document numbered docNum
int deleteDocuments(Term term)
Deletes all documents that have a given term indexed.
Via IndexWriter
void deleteAll()
Delete all documents in the index.
void deleteDocuments(Query query)
Deletes the document(s) matching the provided query.
void deleteDocuments(Query[] queries)
Deletes the document(s) matching any of the provided queries.
void deleteDocuments(Term term)
Deletes the document(s) containing term.
void deleteDocuments(Term[] terms)
Deletes the document(s) containing any of the terms.
Some Statistics
• Dependent on Lucene.NET (a .NET port of Lucene)
Local Testing (Index, Search are on the same device)
Over Network Testing (File server for index file, Standalone searching workstations)
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~32, 180 MB ~50 -> 300 0.2
40 GB ~360, 2.6 GB ~100 -> 3000 3.2
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB X,180 MB ~300 -> 700 X
40 GB X, 2.6 GB ~400 -> 4500 X
Lucene Wrappers (Apache Solr)
• A Java wrapper over Lucene
• A web application that can be deployed on any
servlet container (Apache Tomcat, Jetty)
• A REST service
• It has an administration interface
• Built-in configuration with Apache Tika (a repository of parsers)
• Scalable
• Integration with Apache Hadoop, Apache Cassandra
Solr Administration Interface
Solr Architecture (The Big Picture)
Note: It includes JSON, PHP, Python,… Not only XML.
Communication with Solr (Sending Docs)
• Direct Connection OR Through APIs (SolrJ, SolrNET)
// make a connection to Solr server
SolrServer server = new HttpSolrServer("http://localhost:8080/solr/");
// prepare a doc
final SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField("id", 1);
doc1.addField("firstName", "First Name");
doc1.addField("lastName", "Last Name");
final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add(doc1);
// add docs to Solr
server.add(docs);
server.commit();
Communication with Solr (Searching)
final SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSortField("firstName", SolrQuery.ORDER.asc);
final QueryResponse rsp = server.query(query);
final SolrDocumentList solrDocumentList = rsp.getResults();
for (final SolrDocument doc : solrDocumentList) {
final String firstName = (String) doc.getFieldValue("firstName");
final String id = (String) doc.getFieldValue("id");
}
Some Statistics
Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with
The pure Lucene.NET model.
Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing
can cause some delay depending on the queuing strategy.
Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.)
4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203
40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)
Lucene/Solr Users
• Instagram (geo-search API)
• NetFlix (Generic search feature)
• SourceForge (Generic search feature)
• Eclipse (Documentation search)
• LinkedIn (Recently, Job Search)
• Krugle (SourceCode Search)
• Wikipedia (Recently, Generic Content Search)
References
• Manning Lucene in Action (2nd Edition)
• Lucene Main Website
• Another Presentation on SlideShare
Thank You

More Related Content

What's hot

Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache luceneShrikrishna Parab
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 

What's hot (20)

Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Lucene
LuceneLucene
Lucene
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 

Viewers also liked

Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solrMike Frampton
 
The Future of Library Cataloguing
The Future of Library CataloguingThe Future of Library Cataloguing
The Future of Library CataloguingKathryne Dunlap
 
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and LatinDatech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and LatinIMPACT Centre of Competence
 
Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real WorldEmily Porta
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Jane Frazier
 
Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?Lukas Koster
 
Censorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguingCensorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguingNational Library of Australia
 
Library Carpentry: software skills training for library professionals, Chart...
 Library Carpentry: software skills training for library professionals, Chart... Library Carpentry: software skills training for library professionals, Chart...
Library Carpentry: software skills training for library professionals, Chart...James Baker
 
Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Divya Vyas
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Trainingpvhead123
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineriaviktor93
 
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...DIGIBIS
 
Library of Congress Subject Headings
Library of Congress Subject HeadingsLibrary of Congress Subject Headings
Library of Congress Subject Headingsroycekitts
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information scienceharshaec
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_headJavaria Chiragh
 

Viewers also liked (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
The Future of Library Cataloguing
The Future of Library CataloguingThe Future of Library Cataloguing
The Future of Library Cataloguing
 
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and LatinDatech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
 
Cataloguing in the Real World
Cataloguing in the Real WorldCataloguing in the Real World
Cataloguing in the Real World
 
Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]Day in the life of a data librarian [presentation for ANU 23Things group]
Day in the life of a data librarian [presentation for ANU 23Things group]
 
Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?Library of Congress New Bibliographic Framework - What is it?
Library of Congress New Bibliographic Framework - What is it?
 
Censorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguingCensorship by Omission: Closing off fiction in cataloguing
Censorship by Omission: Closing off fiction in cataloguing
 
Library Carpentry: software skills training for library professionals, Chart...
 Library Carpentry: software skills training for library professionals, Chart... Library Carpentry: software skills training for library professionals, Chart...
Library Carpentry: software skills training for library professionals, Chart...
 
Microdata cataloging tool (nada)
Microdata cataloging tool (nada)Microdata cataloging tool (nada)
Microdata cataloging tool (nada)
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Presentacion mineria
Presentacion mineriaPresentacion mineria
Presentacion mineria
 
Laravel and SOLR
Laravel and SOLRLaravel and SOLR
Laravel and SOLR
 
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
Taller de catalogación Linked Open Data y RDA: posibilidades y desafíos. Prim...
 
RDA y el proceso de catalogación
RDA y el proceso de catalogaciónRDA y el proceso de catalogación
RDA y el proceso de catalogación
 
Library of Congress Subject Headings
Library of Congress Subject HeadingsLibrary of Congress Subject Headings
Library of Congress Subject Headings
 
POPSI
POPSIPOPSI
POPSI
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Indexing or dividing_head
Indexing or dividing_headIndexing or dividing_head
Indexing or dividing_head
 

Similar to Building a Search Engine Using Lucene

Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Java Search Engine Framework
Java Search Engine FrameworkJava Search Engine Framework
Java Search Engine FrameworkAppsterdam Milan
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptLucidworks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Solr introduction
Solr introductionSolr introduction
Solr introductionLap Tran
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolatorjdhok
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc librarySudip Sengupta
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010Rob Windsor
 

Similar to Building a Search Engine Using Lucene (20)

IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Lucene in Action
Lucene in ActionLucene in Action
Lucene in Action
 
Java Search Engine Framework
Java Search Engine FrameworkJava Search Engine Framework
Java Search Engine Framework
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Ingesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScriptIngesting and Manipulating Data with JavaScript
Ingesting and Manipulating Data with JavaScript
 
Dapper
DapperDapper
Dapper
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
DIY Percolator
DIY PercolatorDIY Percolator
DIY Percolator
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010Data Access Options in SharePoint 2010
Data Access Options in SharePoint 2010
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Building a Search Engine Using Lucene

  • 1. Building a Search Engine Using Apache Lucene/Solr
  • 2. Road Map • Problem Definition • A Basic Search Engine Pipeline • Meet Lucene • Lucene API Examples • Lucene Wrappers (Apache Solr, ElasticSearch, Regain, etc….) • Applied Lucene (Real Examples)
  • 3. Problem Definition You got a farm of data, and you want it to be searchable. Analogy: Searching for a needle in a haystack with adding more hay to the stack! - SQL Databases Cons ( > 500,000,000 records …) - Scalability - Decentralization
  • 4. A Basic Search Engine Pipeline • Crawling: Grapping the data • Parsing [Optional]: Understanding the data • Indexing: Build the holding structure • Ranking: Sort the data • Searching: Read that holding structure Behind The Scenes: Analysis, Tokenization, Query Parsing, Boosting, Calculating Term Vectors, Token Filtration, Index Inversion, etc…
  • 5. What is Lucene? • Doug Cutting (Lucene 1999, Nutch 2003, Hadoop 2006) • Free, Java information retrieval library • Application related: Indexing, Searching • High performance, A decade of research • Heavily supported, simply customized • No dependencies
  • 6. What Lucene Ain’t • A complete search engine • An application • A crawler • A document filter/recognizer
  • 7. Lucene Roles Rich Document Rich Document Gather Parse Make Doc Search UI Search App e.g. webapp Search Index Index
  • 8. Lucene Strength Points • Simple API • Speed • Concurrency • Smart indexing (Incremental) • Near Real Time Search • Vector Space Search • Heavily Used, Supported
  • 9. Lucene Query Types • Single Term VS. Multi-Term “+name: camel + type: animal” • Wildcard Queries “text:wonder*” • Fuzzy Queries “room~0.8” • Range Queries “date:[25/5/2000 To *]” • Grouped Queries “text: animal AND small” • Proximity Queries “hamlet macbeth”~10 • Boosted Queries “hamlet^5.0 AND macbeth”
  • 10. API Sample I (Indexing) private IndexWriter writer; public Indexer(String indexDir) throws IOException { Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); } public void close() throws IOException { writer.close(); } public void index(String dataDir, FileFilter filter) throws Exception { File[] files = new File(dataDir).listFiles(); for (File f: files) { Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } }
  • 11. Indexing Pipeline (Simplified) Tokenizer TokenFilterDocument Document Writer Inverted Index add
  • 12. Analysis Basic Types "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]
  • 13. The Inverted Index (In a nutshell)
  • 14. API Sample II (Searching) public void search(String indexDir, String q) throws IOException, ParseException { Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir, true); QueryParser parser = new QueryParser("contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); System.err.println("Found " + hits.totalHits + " document(s)"); for (int i=0; i<hits.scoreDocs.length; i++) { ScoreDoc scoreDoc = hits.scoreDocs[i]; Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("filename")); } is.close(); }
  • 15. Index Update • Lucene doesn’t have an update mechanism. So? • Incremental Indexing (Index Merging) • Delete + Add = Update • Index Optimization
  • 16. API Sample III (Deleting) Via IndexReader void deleteDocument(int docNum) Deletes the document numbered docNum int deleteDocuments(Term term) Deletes all documents that have a given term indexed. Via IndexWriter void deleteAll() Delete all documents in the index. void deleteDocuments(Query query) Deletes the document(s) matching the provided query. void deleteDocuments(Query[] queries) Deletes the document(s) matching any of the provided queries. void deleteDocuments(Term term) Deletes the document(s) containing term. void deleteDocuments(Term[] terms) Deletes the document(s) containing any of the terms.
  • 17. Some Statistics • Dependent on Lucene.NET (a .NET port of Lucene) Local Testing (Index, Search are on the same device) Over Network Testing (File server for index file, Standalone searching workstations) Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB ~32, 180 MB ~50 -> 300 0.2 40 GB ~360, 2.6 GB ~100 -> 3000 3.2 Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB X,180 MB ~300 -> 700 X 40 GB X, 2.6 GB ~400 -> 4500 X
  • 18. Lucene Wrappers (Apache Solr) • A Java wrapper over Lucene • A web application that can be deployed on any servlet container (Apache Tomcat, Jetty) • A REST service • It has an administration interface • Built-in configuration with Apache Tika (a repository of parsers) • Scalable • Integration with Apache Hadoop, Apache Cassandra
  • 20. Solr Architecture (The Big Picture) Note: It includes JSON, PHP, Python,… Not only XML.
  • 21. Communication with Solr (Sending Docs) • Direct Connection OR Through APIs (SolrJ, SolrNET) // make a connection to Solr server SolrServer server = new HttpSolrServer("http://localhost:8080/solr/"); // prepare a doc final SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField("id", 1); doc1.addField("firstName", "First Name"); doc1.addField("lastName", "Last Name"); final Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>(); docs.add(doc1); // add docs to Solr server.add(docs); server.commit();
  • 22. Communication with Solr (Searching) final SolrQuery query = new SolrQuery(); query.setQuery("*:*"); query.addSortField("firstName", SolrQuery.ORDER.asc); final QueryResponse rsp = server.query(query); final SolrDocumentList solrDocumentList = rsp.getResults(); for (final SolrDocument doc : solrDocumentList) { final String firstName = (String) doc.getFieldValue("firstName"); final String id = (String) doc.getFieldValue("id"); }
  • 23. Some Statistics Note 1: We’re sending HTTP POST requests to Solr server, That can take a lot if we compared it with The pure Lucene.NET model. Note 2: Consider a server with upcoming requests from everywhere, OS related issues with queuing can cause some delay depending on the queuing strategy. Dataset Size Indexing (Min.) Retrieval (Ms.) Opt.(Min.) 4.3 GB ~39.5, 169 MB ~300 -> 3000 0.203 40 GB ~400 (Not accurate), 40 GB ~300 -> 10000 ~7 (Not accurate)
  • 24. Lucene/Solr Users • Instagram (geo-search API) • NetFlix (Generic search feature) • SourceForge (Generic search feature) • Eclipse (Documentation search) • LinkedIn (Recently, Job Search) • Krugle (SourceCode Search) • Wikipedia (Recently, Generic Content Search)
  • 25. References • Manning Lucene in Action (2nd Edition) • Lucene Main Website • Another Presentation on SlideShare

Editor's Notes

  1. Don’t forget, The concept of documents
  2. Note that you can even make your custom analyzers, It depends on the application needs