Database indexing framework

6,449 views

Published on

This document explains some thoughts on the working of a framework to index a database into Solr index

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,449
On SlideShare
0
From Embeds
0
Number of Embeds
2,483
Actions
Shares
0
Downloads
102
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Database indexing framework

  1. 1. Database Indexing Framework ( Version 1.0 )
  2. 2. <ul><li>Objective </li></ul><ul><li>To index database tables using Solr </li></ul><ul><li>Requirements </li></ul><ul><li>To create Search Index relevant views (like database views) by collating data from multiple database tables </li></ul><ul><li>To convert data from database to XMLs that can be posted to Solr </li></ul><ul><li>To enable incremental indexing </li></ul>Overview
  3. 3. <ul><li>Possible approaches to create Search Index relevant views </li></ul><ul><li>at the database level (creating Database Views) </li></ul><ul><li>( this will involve creation of database views based on search result requirements. </li></ul><ul><li>For example the message module or the shopping module both have different search result requirement. So probably one view could cater to one module. These views only have columns relevant to search. </li></ul><ul><li>Here the application layer will directly get prepared data from the database and will just have the job of posting it to Solr ) </li></ul><ul><li>at the database level (using Procedures) </li></ul><ul><li>( this will involve creation of procedures to fetch the index relevant data ) </li></ul><ul><li>at the application layer </li></ul><ul><li>( in this approach we give the work of collating data from the various database table to the application layer. It queries the various relevant DB tables and then collects the data and posts it to Solr ) </li></ul>Overview
  4. 4. <ul><li>Possible approaches for Incremental Indexing </li></ul><ul><li>at real time (the push approach) </li></ul><ul><li>( in this approach the data is indexed as soon as it is entered into the database. </li></ul><ul><li>It will involve database listeners listening for changes in the database and queuing </li></ul><ul><li>up the new and updated records to be indexed in a JMS queue. This queue is </li></ul><ul><li>consumed by a indexing program that again queries the database based on the </li></ul><ul><li>primary keys in the JMS queue to get data, convert it to Solr XML and post them </li></ul><ul><li>to Solr ) </li></ul><ul><li>as a batch process at regular intervals (the pull approach) </li></ul><ul><li>( here instead of immediate indexing, we fetch data from the database after some </li></ul><ul><li>configurable regular interval. Unlike to the real time approach, here the chances </li></ul><ul><li>of failure are minimal ) </li></ul>Overview
  5. 5. The following slides discuss a incremental indexing approach that we thought would work well for our requirements. In this approach the Search Index relevant views are created using Database Views and the indexing is done as a Batch Process and not at real time. First we need to understand the need for the Database Views . When a search term is searched for in the index, the result page shows some details and summary of the result. For instant results these details need to be stored in the index itself so we don’t have to hit the database just to display collated results in the results page. When creating the Solr index it then doesn't make much sense to index all the tables individually. This is because each table will have it own dependencies with child and parent tables. We will either have to create similar dependencies in the index or else create our indexes intelligently keeping the search needs in mind. This will involve creating appropriate joins across tables to fetch all the data relevant to a search result at one shot. The database view can do this job of collating data from the parent and child tables in a representation that exactly matches the requirements of the search index. This makes the job of the application layer hassle free. It just picks everything from the view and indexes it as it is. Incremental Indexing Process ( the need for Database Views )
  6. 6. Next we need to understand why the Batch Indexing process can work well for us. Most of our search requirements would involve searching for historic data. Rarely could there be cases where we search for data put in immediately. Even these cases can be handled by setting the Batch Process interval to a very small time. The real time indexing process can become a pretty expensive process in case a large amount of data is entered in small intervals. Also the batch process gives us the flexibility of working on a copy of the database to make the whole indexing process an offline one. Incremental Indexing Process ( the need for Batch indexing )
  7. 7. Database Result Set to XML Converter Data Fetcher Indexing Job Scheduler Database Indexer (the controller class) SOLR Index Manager (9) Solr XML (1) Indexing Job Name (2) Database View Name (5) Result Set (6) Solr XML (3) Query (4) Result Set (8) Solr XML Indexing Job - Trigger Config file ( Indexing Job Schedules ) Trigger Time 1 - Indexing Job 1 Trigger Time 2 - Indexing Job 2 Trigger Time 3 - Indexing Job 3 7) Solr XML Incremental Indexing Batch Process ( the flow ) Components in green are explained in detail in next slide >> Indexing Job – Database View Mapping file More than one DB view might need to be indexed at the same time, so these can be as an Indexing Job. Indexing Job 1 – Database View1 Database View2 Database View3 Database View4 Indexing Job 2 – Database View5 Database View6 DB View Column name to Solr field mapping - Database View 1 Column 1 - Solr Field 1 Column 2 - Solr Field 2 Column 3 - Solr Field 3 - Database View 2 Column 1 - Solr Field 3 Column 2 - Solr Field 2
  8. 8. Incremental Indexing Batch Process ( the components ) An Indexing Job has been defined as indexing of all the set of Database Views that need to be indexed at the same time and at equal time intervals. Triggers holds the time information, the start time, time interval and other such time related details. So when a Indexing Job is associated to a trigger, the job will run according to the start time and time intervals as mentioned in the trigger. Indexing Job - Trigger Config file has all Indexing Job Schedules. It maps triggers to indexing jobs. Indexing Job – Database View Mapping file defines the Indexing Jobs. It associates Database Views with each Indexing Job. If a database view like the one for the messages module requires to be picked up for at a smaller time interval than the one for the shopping module, then they will be part of different indexing jobs having different Triggers. Database Indexer acts as the controller of the database indexing process. It does the job of calling the Data Fetcher to get database records in XML format which it sends to the Index Manager to post it to Solr. The Data Fetcher communicates with the database to get all the new and updated records for a given database view along with those records that have been marked for deletion. It then feeds this data to the Result Set to XML converter to get the data converted to the Solr recognizable XML format. The Result Set to XML converter is a utility class which converts database records to XML format. If the record is new or updated it puts it in the <add> tag. If it is marked for deletion then it is put in the <delete> tag. It picks up Solr Field names corresponding to the DB View Column names from the DB View Column name to Solr field mapping file.
  9. 9. Incremental Indexing Batch Process ( the flow) The indexing process is triggered off by the Indexing Job Scheduler . An indexing job is triggered from the Indexing Job Scheduler based on the trigger settings to which it is associated in the Indexing Job - Trigger Config file . The Indexing Job Scheduler makes a call to the Database Indexer sending the name of the job to done as an argument. The Database Indexer acts as the controller for this whole process. It picks up the names of Database Views to be indexed corresponding to the Indexing Job sent by Indexing Job Scheduler from the Indexing Job – Database View Mapping file . The Database Indexer loops over the set of Database Views and makes a call to the Data Fetcher for each View. The Data Fetcher hits the database with a query to get all the latest records from the View. The result set is sent to Result set to XML Converter which return the Solr XML. This Solr XML is sent back to the Database Indexer which in turn sends it to the Index manger for posting it to Solr.
  10. 10. (4) Result Set (3 ) View Query Indexing Job to Database Views mapping file Job - Trigger Config file (Indexing Job Schedules) DB View Column name to Solr field mapping (2 ) Database View Name (7) Solr XML (6) Solr XML (5) Result Set (8) Solr XML (1) Indexing Job Name Indexing Job Scheduler Triggers Database Indexer with an Indexing job based on the trigger times in the Job - Trigger Config file <ul><li>Database Indexer ( @parameter Indexing Job) </li></ul><ul><li>Picks up list of Database Views corresponding to a Indexing Job from the Indexing Job – Database View Mapping file </li></ul><ul><li>Loops over the Database View Names and calls the Data Fetcher ( @parameter Database View Name) for each View to get back the corresponding result set. </li></ul><ul><li>Sends the Solr XML to the Index Manager </li></ul><ul><li>Data Fetcher </li></ul><ul><li>(@parameter Database View Name ) </li></ul><ul><li>Fires a generic “Select * from View Name ” query to get relevant data </li></ul><ul><li>The ResultSet to XML Converter (@parameter Result Set ) </li></ul><ul><li>is called for each result set to get the Solr XML for that result set. </li></ul><ul><li>Result Set to XML Converter </li></ul><ul><li>(@parameter Result Set ) </li></ul><ul><li>Loops over the Result Set. </li></ul><ul><li>Based on the Result Set Metadata get the corresponding solr field names from the DB View Column name to Solr field mapping </li></ul><ul><li>Create the Solr XML file for the result set. </li></ul>Database SOLR Index Manager (9) Solr XML

×