Solr 4.1 Abhey GuptaSoftware Engineer (Java)Value First Digital Pvt Ltd
OutlineThis presentation will guide from series of question in try to answer usability of Solr in MIS 3– What is Solr ?– Why use Solr ?– Current Scenairo– Scope of Improvement– Indexing Data– Import MIS Data– Challenges– Demo– Query Example
What is Solr?Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. • Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. • Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the worlds largest internet sites.
What is Lucene?Apache Lucene is a high-performance, full-featured text search engine librarywritten entirely in Java. It is a technology suitable for nearly any applicationthat requires full-text search, especially cross-platform – An open source Java-based IR library with best practice indexing and query capabilities, fast and lightweight search and indexing. – 100% Java (.NET, Perl and other versions too). – Stable, mature API. – Continuously improved and tuned over more than 10 years. – Cleanly implemented, easy to embed in an application. – Compact, portable index representation. – Programmable text analyzers, spell checking and highlighting. – Not a crawler or a text extraction tool.
Who uses Lucene/Solr?Here are five noteworthy public sites that use Solr to handle search:– WhiteHouse.gov – The Obama administrations keystone web site is Drupal and Solr!– Netflix – Solr powers basic movie searching on this extremely busy site.– Internet Archive – Search this vast repository of music, documents and video using Solr.– StubHub.com – This ticket reseller uses Solr to help visitors search for concerts and sporting events.– The Smithsonian Institution – Search the Smithsonian’s collection of over 4 million items.
Why uses Solr? Assuming the user has a relational DB, why use Solr? If your use caserequires a person to type words into a search box, you want a text searchengine like Solr.Databases and Solr have complementary strengths and weaknesses.SQL supports very simple wildcard-based text search with some simplenormalization like matching upper case to lower case. The problem is thatthese are full table scans. In Solr all searchable words are stored in an"inverse index", which searches orders of magnitude faster.For Deatils Please consult below link – http://wiki.apache.org/solr/WhyUseSolr
Current ScenarioIn current ,MIS 3 use mysql FULL TEXT search for text based search whichlacks behind solr in terms of Query Speed & Text Search 1. Full Text 2. Full Text Search Search MIS UI Query USER (80) MYSQL 4. Result 3. Full table Scan for text Search
Scope of ImprovementInstead of quering MYSQL for text search , we can deploy Solr inbetween ,which will return result , being inverted index , this quering is fast andefficient. 1. Full Text Search MIS UI USER (80) MYSQL 4. Result 2. Full Text Search Query SOLR 3. Scan for tokenized text Search in inverted index
Indexing Data in SolrA Solr index can accept data from many different sources, including XMLfiles, comma-separated value (CSV) files, data extracted from tables in adatabase, and files in common file formats such as Microsoft Word or PDF.Here are the most common ways of loading data into a Solr index: – Uploading Structured Data Store Data with the Data Import Handler – Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats. – Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated. – Writing a custom Java application to ingest data through Solrs Java
Indexing MIS DataMIS has structured data on MIS server and Structured files on servicesserver , so this way we can index data in two ways , These are following – Data Import Handler on MIS Database • This has benefit of manageability , as this needs to be deployed on MIS servers only,which are very few. • We can import data on delta incremental. – Script to import CSV files from services • This will increase in manageability and deployability of scripts on services • Need to implement partial import for DLRLOG data.
Indexing BeanSolr can also import bean type for indexing , in Services we build bean ofEvery MT and DLR , we can Directly import them on Solr.This could increase into unneccesary load , as API will index bean permessages. Index Data call per MT and DLR API 15 SOLR
Data Import HandlerWe can import data to index in solr from mysql , we can do this in two ways , disctributed or centeral SOLR MYSQL MYSQL SOLR MYSQL MYSQL SOLR SOLR MYSQL MYSQL MYSQL SOLR MYSQL
Import CSVWe can import data to index in solr from each services , we can also do this in two ways , disctributed or centeral SCRIPT Service 1 SOLR SCRIPT Service 1 SOLR Service 1 SCRIPT SOLR ….......... Service 1 SCRIPT SOLR
ChallengesEvery Import Scenario advantages trade off with some disadvantages andchallenges .For Example – DIH : Data Import handler require joins with sql query to import data from mttextlog,mtlog and dlrlog. • Or we can get messageid from Solr and query again to mysql for complete data with in clause query for message id. – CSV Import : It requires scripts to be deployed on every service server and lots of managebablity of files proccessed or not proccessed. – BEAN Import : It requires changes at API level and could result into