Information Retrieval Theory :
A case study involving Apache Lucene + Solr: A
distributed Search Engine
By
Alok Dhamanaska...
Outline
●Problem Description
●About Lucene, Solr, and Hadoop HDFS
●Solution: Implementation
●Data tested
●Demo
●Conclusion...
Problem Description
●Search among large data-sets across
thousands of documents, databases, etc.. with a
simple query
●SQL...
About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximit...
Project Implementation
●Implementation of Hadoop (Cloudera version)
●Integration with the FS through Fuse
●Setup of Solr i...
Solution: Implementation
Data tested
●Public Data from the Carl Vinson Institute of Government -
ITOS
oIndexed 10 Schema
oMore than 500 columns ind...
DEMO
●Hadoop HDFS pseudo-distributed
implementation
●HDFS mountable with Fuse
●Solr instances configuration
●Solr Client W...
Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a pow...
Questions?
Upcoming SlideShare
Loading in …5
×

Ads final project

168 views
141 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
168
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ads final project

  1. 1. Information Retrieval Theory : A case study involving Apache Lucene + Solr: A distributed Search Engine By Alok Dhamanaskar Manuel Correa
  2. 2. Outline ●Problem Description ●About Lucene, Solr, and Hadoop HDFS ●Solution: Implementation ●Data tested ●Demo ●Conclusions ●Questions
  3. 3. Problem Description ●Search among large data-sets across thousands of documents, databases, etc.. with a simple query ●SQL does not support full text search across multiple fields, with ranking and other data mining ●Data might contain geospatial data and operations searching by distant and buffers ●Also most data intensive applications demand a high availability and be persistent
  4. 4. About Lucene, Solr, and Hadoop HDFS ●Lucene: Java Index Engine oRanked searching oQuery types: Phrase, wild-card, proximity oDocument fields searching oSorting ●Solr: oWeb Application that interacts with Lucene Engine oRestful interfaces for searching, indexing, deleting, etc... oExtend Lucene: Geospatial Search, Schemas integration, Monitoring, Sharding index ●Hadoop HDFS oDistributed File system
  5. 5. Project Implementation ●Implementation of Hadoop (Cloudera version) ●Integration with the FS through Fuse ●Setup of Solr instances ●Data manipulation in DB(Oracle and SQLServer) ●Data Index from Database to Solr ●Distributed Search implementation in Solr ●Solr Client Web Application development
  6. 6. Solution: Implementation
  7. 7. Data tested ●Public Data from the Carl Vinson Institute of Government - ITOS oIndexed 10 Schema oMore than 500 columns indexed including Location information oApproximately 200,000 document created oMore than 15,000,000 data items indexed for each document oInformation related with: Government Buildings, Clinics, Hospitals, Fire Stations, Teen centers, Service facilities, shelters, Child support offices, Historical resources, and Archaeological sites
  8. 8. DEMO ●Hadoop HDFS pseudo-distributed implementation ●HDFS mountable with Fuse ●Solr instances configuration ●Solr Client Web application
  9. 9. Conclusions ●HDFS offers high availability to store index documents ●Solr offer a light-weight solution to implement a powerful search engine ●Solr is a "cheap" solution to implement basic geospatial search engine ●Solr's Restful API makes it easy to integrate with any Enterprise System
  10. 10. Questions?

×