Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Information Retrieval Theory :
A case study involving Apache Lucene + Solr: A
distributed Search Engine
By
Alok Dhamanaska...
Outline
●Problem Description
●About Lucene, Solr, and Hadoop HDFS
●Solution: Implementation
●Data tested
●Demo
●Conclusion...
Problem Description
●Search among large data-sets across
thousands of documents, databases, etc.. with a
simple query
●SQL...
About Lucene, Solr, and Hadoop HDFS
●Lucene: Java Index Engine
oRanked searching
oQuery types: Phrase, wild-card, proximit...
Project Implementation
●Implementation of Hadoop (Cloudera version)
●Integration with the FS through Fuse
●Setup of Solr i...
Solution: Implementation
Data tested
●Public Data from the Carl Vinson Institute of Government -
ITOS
oIndexed 10 Schema
oMore than 500 columns ind...
DEMO
●Hadoop HDFS pseudo-distributed
implementation
●HDFS mountable with Fuse
●Solr instances configuration
●Solr Client W...
Conclusions
●HDFS offers high availability to store index documents
●Solr offer a light-weight solution to implement a pow...
Questions?
Upcoming SlideShare
Loading in …5
×

Ads final project

242 views

Published on

  • Be the first to comment

  • Be the first to like this

Ads final project

  1. 1. Information Retrieval Theory : A case study involving Apache Lucene + Solr: A distributed Search Engine By Alok Dhamanaskar Manuel Correa
  2. 2. Outline ●Problem Description ●About Lucene, Solr, and Hadoop HDFS ●Solution: Implementation ●Data tested ●Demo ●Conclusions ●Questions
  3. 3. Problem Description ●Search among large data-sets across thousands of documents, databases, etc.. with a simple query ●SQL does not support full text search across multiple fields, with ranking and other data mining ●Data might contain geospatial data and operations searching by distant and buffers ●Also most data intensive applications demand a high availability and be persistent
  4. 4. About Lucene, Solr, and Hadoop HDFS ●Lucene: Java Index Engine oRanked searching oQuery types: Phrase, wild-card, proximity oDocument fields searching oSorting ●Solr: oWeb Application that interacts with Lucene Engine oRestful interfaces for searching, indexing, deleting, etc... oExtend Lucene: Geospatial Search, Schemas integration, Monitoring, Sharding index ●Hadoop HDFS oDistributed File system
  5. 5. Project Implementation ●Implementation of Hadoop (Cloudera version) ●Integration with the FS through Fuse ●Setup of Solr instances ●Data manipulation in DB(Oracle and SQLServer) ●Data Index from Database to Solr ●Distributed Search implementation in Solr ●Solr Client Web Application development
  6. 6. Solution: Implementation
  7. 7. Data tested ●Public Data from the Carl Vinson Institute of Government - ITOS oIndexed 10 Schema oMore than 500 columns indexed including Location information oApproximately 200,000 document created oMore than 15,000,000 data items indexed for each document oInformation related with: Government Buildings, Clinics, Hospitals, Fire Stations, Teen centers, Service facilities, shelters, Child support offices, Historical resources, and Archaeological sites
  8. 8. DEMO ●Hadoop HDFS pseudo-distributed implementation ●HDFS mountable with Fuse ●Solr instances configuration ●Solr Client Web application
  9. 9. Conclusions ●HDFS offers high availability to store index documents ●Solr offer a light-weight solution to implement a powerful search engine ●Solr is a "cheap" solution to implement basic geospatial search engine ●Solr's Restful API makes it easy to integrate with any Enterprise System
  10. 10. Questions?

×