Your SlideShare is downloading. ×
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Integrating Lucene into a Transactional XML Database

1,023

Published on

Presented by Petr Pleshachkov, EMC - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 …

Presented by Petr Pleshachkov, EMC - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

In this talk we will present an integration of the Lucene search engine with EMC Documentum xDB database (native XML database). We will introduce a new approach implemented in xDB 10.3 which integrates Lucene index (used for XQuery queries optimization) into transactional xDB engine on the storage level. That is, Lucene files are stored to the XDB data pages instead of the file system as in earlier releases, Lucene accesses all the files through xDB buffer pool instead of the just the Operating system buffer cache. This approach allows us to simplify the implementation of traditional database features for Lucene within xDB like transactions isolation, rollbacks, recovery after database crashes, snapshots construction , replication, hot backups, buffer management, etc. We cover performance analysis of new approach for queries and ingest operations, performance tuning tips and future optimization techniques in the area. The presentation is intended as a description of an implementation and performance analysis.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,023
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Boston May 7-10 2012 Integrating Lucene search engine into a transactional XML database Petr Pleshachkov, EMC petr.pleshachkov@emc.com, May 9, 2012© Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. My BackgroundŸ Petr Pleshachkov, Principal Software EngineerŸ xDB/xPlore team in Rotterdam –  My site: EMC Netherlands –  Other xPlore/xDB sites: Pleasanton (California), Shanghai (China), and Grenoble (France)Ÿ Areas of expertise: –  Semistructured data management –  Databases: transaction management, query optimization, full-text searchŸ Academia & Research: –  PhD in Computer Science, ISP RAS© Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. AgendaŸ Overview of EMC Documentum xDB/xPloreŸ Integration of Lucene into xDBŸ xDB transaction model & lucene transaction managementŸ Performance analysisŸ Future optimizations© Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Introducing Documentum xPlore •  EMC Documentum is a leading supplier of Enterprise Content Management software •  xPlore Provides ‘Integrated Search’ for Documentum –  but is built as a standalone search engine to replace FAST Instream –  Highly deployed across Documentum environments worldwide (over 70+ countries) •  xPlore Search Engine built over EMC xDB, Lucene, and leading content extraction and linguistic analysis software© Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Key values which xDB brings for xPloreWhy build a search engine over an XML database? Ÿ  Flexible, hierarchical query & data models Ÿ  Joins Ÿ  High throughput, low-latency indexing – See documents within secs after saving Ÿ  Leverage B-tree indexes when appropriate – Lucene doesn’t fit all uses Ÿ  Rich, innovative query language Ÿ  Enterprise, single unified database© Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Documentum xDBŸ  Formerly XHive database –  100% Java –  XML stored in persistent DOM format ▪  Each XML node can be located through a 64 bit identifier ▪  Structure mapped to pages ▪  Easy to operate on GB XML filesŸ  Full Transactional DatabaseŸ  Query Language: XQueryŸ  Indexing & Optimization –  Palette of index options optimizer can pick from –  At it simplest: indexLookup(key) -> node idŸ  Backup/Restore, scalability, multi-node architecture© Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. xDB Data Storage Model An XML Document can be thought of as a collection of elements, attributes (or ‘xml nodes’)‫‏‬ A B C This node structure D can be represented as E a tree - DOM model Database A B C D E page© Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Libraries & Indexes = = xDB Library X-Hive Library A = X-Hive Index = xDB Index = = xDB xml X-Hive xml file B C fileScope of indexcovers all xml files in Aall sub-libraries C B© Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Lucene IntegrationŸ Both value and full-text queries supported –  XML SubPaths mapped into lucene fields –  Tokenized and value based indexes availableŸ Composite key queries supported –  Lucene index is much more flexible than B- tree composite indexes© Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Multipath Index Definition <PLAY> <ACT> <SCENE> <SPEECH> <SPEAKER>BRUTUS</SPEAKER> <LINE>I am not gamesome: I do lack some part</LINE> </SPEECH> <SPEECH> <SPEAKER>CASSIUS</SPEAKER> <LINE>Then, Brutus, I have much mistook your passion;</LINE> <LINE>By means whereof this breast of mine hath buried</LINE> <LINE>Thoughts of great value, worthy cogitations.</LINE> </SPEECH> </SCENE> </ACT> </PLAY> INDEX ROOT PATH: //SPEECH SubPath1: (/SPEAKER, VALUE_COMPARISON) SubPath2: (//LINE, FULL_TEXT_SEARCH)© Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Lucene Query Mapping for $SPEECH score $s in collection(‘col1’)// SPEECH[SPEAKER=’BRUTUS’ and //LINE contains text ‘lack’] order by $s return $SPEECHBooleanQuery (TermQuery1, TermQuery2, BooleanClause.Occur.MUST)TermQuery1= TermQuery(new Term(‘/speaker_field’, ‘BRUTUS’))TermQuery2=TermQuery(new Term(‘//line_field’, ‘lack’))© Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Lucene SubIndexesŸ Each user transaction creates a separate Lucene subIndexŸ Transaction performs all the updates in its own indexŸ The delete operation does not physically touch subIndexes created by other transactionsŸ A pair (minLSN, maxLSN) is associated with each subIndex, which is used to construct a global index snapshot.© Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. BlacklistsŸ The delete operation of transaction: –  Physically deletes document from transaction’s own subIndex –  Adds a pair (subIndexMinLSN, NODE_ID) to the blacklist structureŸ The persistent blacklist structure is represented as xdb B-tree index with key = subIndexMinLSN, value=NODE_IDŸ Periodically merge operation merges small subIndexes into bigger one and physically deletes documents.© Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. xDB transaction managementŸ ARIES-based ACID transactions –  Every page has a Log Sequence Number (pageLSN) –  Buffer manager tracks dirty pages using RecLSNs –  Log ALL updates on per page basis, including updates performed during rollbacks –  Periodically asynchronous thread runs checkpoint procedure –  The recovery procedure: ▪  Repeat the history. Redo all the updates since the last successful checkpoint ▪  Undo not complete transactions© Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. xDB transaction isolationŸ READ_WRITE transaction follow two-phase- locking rule: –  Expanding phase: locks are acquired and no locks are released –  Shrinking phase: locks are released and no locks are acquiredŸ READ_ONLY transaction does not acquire any locks! –  The data snapshot at the moment of transaction start is used –  Using log records we undo recent changes on the page level© Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. How to integrate Lucene into transactional xDB database ?Ÿ Old Solution (xDB 10.1/10.2 releases) –  All lucene files are stored in separate directory –  New transaction model for lucene indexes is implemented –  Lucene does not use xDB buffer pool –  Backup/restore and replication do not use xDB mechanismsŸ New Solution (xDB 10.3) –  All lucene files are stored in xDB data segment –  xDB transaction model is used since all the updates go through xDB data pages –  Backup/restore and replication are supported automatically© Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Lucene Index Access ModelŸ  New LIDirectoryImpl class is implemented (extends Directory class)Ÿ  LIDirectory class stores all files in xDB blob objectsŸ  LIIndexInput class extends BufferedIndexInput –  void readInternal(byte[] b, int offset, int len) ▪  Reads data from the blob ▪  The blob object is buffered on the xdb buffer management levelŸ  LIIndexOutput class extends BufferedIndexOutput –  void flushBuffer(byte[] b, int offset, int len) ▪  Writes lucene data to the blob object ▪  The operation is logged automatically on the buffer manager level© Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Lucene Index Access Model (con’t) Queries Indexer IndexReader IndexWriter LIDirectoryImpl   LIIndexInput   LIIndexOutput   readInternal flushBuffer   Lucene Caches buffered  data  pages Lucene Blob Objects© Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Lucene SubIndex Storage Model Directory page LIDirectoryStore LiFileEntryStore LiFileEntryStore     BlobStore page BlobStore page Blob  Tail   Blob  Tail           Blob Blob Blob Blob Blob Blob page page page page page page© Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Lucene Index Master Record (MIR) •  Tracks information about all subindexes SI_1 SI_2 SI_3 … SI_N and their state •  Represented as a B- tree concurrent index Directory Directory •  Used for lucene index object Object view construction Blob objects •  Updated concurrently by Ingest transactions and merging/cleaning tasks •  Periodically asynchronous tasks merges subIndexes into bigger one© Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Ingest performance analysis (in seconds) 3000 2500 2526.601 2000 2149.636 1500 1000 1009.459 1015.937 500 180.956 205.068 0 Ingest 10000 docs Ingest 50000 docs Ingest 100000 docs xDB 10.3 (pre-release) xDB 10.2© Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Query performance analysis (response time in ms.)1614 14.0131210 10.08 8 7.713 7.088 6 4 2 0 Q1 serie: queries with range and 3 value Q2 serie: queries with full-text and 2 comparison conditions value-comparison conditions xDB 10.3 (pre-release) xDB 10.2© Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Future optimizationsŸ Reduce number of separate subIndexesŸ Final/NonFinal merge optimizationsŸ Advanced buffer management techniquesŸ Concurrent Lucene MultiPath Index© Copyright 2012 EMC Corporation. All rights reserved. 23

×