A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC ed...
Agenda <ul><li>My Background </li></ul><ul><li>Documentum xPlore Context and History </li></ul><ul><li>Overview of Documen...
My Background <ul><li>Ed Buech é </li></ul><ul><li>Information Intelligence Group within EMC </li></ul><ul><li>EMC Disting...
Documentum search 101 <ul><li>Documentum Content Server provides an “object/relational” data model and query language </li...
Introducing Documentum xPlore  <ul><li>Provides ‘Integrated Search’ for Documentum </li></ul><ul><ul><li>but is built as a...
<ul><li>Documentum Search  </li></ul><ul><li>History-at-a-glance </li></ul><ul><li>almost 15 years of Structured/Unstructu...
Enhancing Documentum Deployments  with Search <ul><li>Without Full Text in a Documentum deployment a DQL query  will be di...
Enhancing Documentum Deployments  with Search <ul><li>DQL for search can be directed to the full text engine  instead of  ...
Some Basic Design Concepts behind Documentum xPlore <ul><li>Inverted Indexes are not optimized for all use-cases </li></ul...
Design concepts (con’t) <ul><li>Applications need fluid, changing metadata schemas that can be efficiently queried </li></...
Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities H...
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions  ...
Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full ...
Documentum xPlore <ul><li>Bring best-of-breed XML Database with powerful Apache Lucene Fulltext Engine </li></ul><ul><li>P...
EMC xDB: Native XML database <ul><li>Formerly XHive database </li></ul><ul><ul><li>100% java  </li></ul></ul><ul><ul><li>X...
Libraries / Collections & Indexes = xDB segment =  xDB Library  / xPlore collection =  xDB Index =  xDB xml file ( dftxml ...
Lucene Integration <ul><li>Transactional  </li></ul><ul><ul><li>Non-committed index updates in separate (typically in memo...
Lucene Integration (con’t) <ul><li>Both value and full text queries supported </li></ul><ul><ul><li>XML elements mapped to...
xPlore has lucene search engine capabilities plus…. <ul><li>XQuery provides powerful query & data manipulation language </...
Tips and Observations on  IO and Host Virtualization <ul><li>Virtualization offers huge savings for companies through cons...
Tip #1: Don’t assume that  one-size-fits all <ul><li>Most IT shops will create “VM or SAN templates” that have a fixed res...
Same concept applies for disk virtualization <ul><li>The capacity of disks are typically expressed in terms of two metrics...
Linear mapping’s and Luns <ul><li>When  mapped  directly to physical disks then this  could  concentrate I/O to fewer than...
EMC Symmetrix: Nondisruptive Mobility <ul><li>Virtual LUN VP Mobility </li></ul><ul><li>Fast, efficient mobility </li></ul...
Tip #2: Consolidation Contention <ul><li>Virtualization provides benefit from consolidation </li></ul><ul><li>Consolidatio...
Some Vmware statistics <ul><li>Ready  metric </li></ul><ul><ul><li>Generated by Vcenter and represents the number of cycle...
Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this c...
Actual Ready samples during  several hour period
Some Subtleties with  Interactive CPU denial <ul><li>The Ready metric represents denial upon demand </li></ul><ul><ul><li>...
Sharing I/O capacity <ul><li>If Multiple VM’s (or servers) are sharing the same underlying physical volumes  and  the capa...
Recommendations on diagnosing disk I/O related issues <ul><li>On Linux/UNIX </li></ul><ul><ul><li>Have IT group install SA...
Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretes...
Linux indicators compared  to bonnie output See  https://community.emc.com/docs/DOC-9179  for additional example Device:  ...
Tip #3: Try to ensure availability  of resources <ul><li>Similar to the previous issue, but  </li></ul><ul><ul><li>resourc...
IO / caching test use-case <ul><li>Unselective Term search  </li></ul><ul><ul><li>100 sample queries </li></ul></ul><ul><u...
Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-id...
IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security ...
Separation of “covering values” in  stored fields and summary  Facet Calc FinalFacet  calc values over thousands of result...
xPlore Memory Pool areas  at-a-glance xPlore Instance  (fixed size) memory xDB Buffer Cache Lucene Caches & working memory...
Lucene data resides primarily in OS buffer cache Potential for many things to sweep  lucene from that cache
Test Env <ul><li>32 GB memory </li></ul><ul><li>Direct attached storage (no SAN) </li></ul><ul><li>1.4 million documents <...
Some results of the query suite <ul><li>Linux buffer cache cleared completely before each run </li></ul><ul><li>Resp as se...
Other Notes <ul><li>Caching 2% of index yields a response time that is only 60% greater than if the entire index was cache...
Contact <ul><li>Ed Buech é </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>http://community.emc.com/peopl...
Upcoming SlideShare
Loading in …5
×

I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

1,002 views

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Prior to VM’s the CPU/memory resources of a server were dedicated to that application resource planners didn’t worry too much about inter-system contention Resource planning more simple , but expensive in unused capacity VMware achieves significant cost reduction by allowing applications to share unused capacity On average (over the day) the Lucene CPU consumption might be low However, the challenge is that if the capacity is not available when Lucene needs it, then response delays will result
  • I/O & virtualization performance with a search engine based on an xml database & lucene - By Ed Bueche

    1. 1. A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene Ed Buech é , EMC edward.bueche@emc.com, May 25, 2011
    2. 2. Agenda <ul><li>My Background </li></ul><ul><li>Documentum xPlore Context and History </li></ul><ul><li>Overview of Documentum xPlore </li></ul><ul><li>Tips and Observations on IO and Host Virtualization </li></ul>
    3. 3. My Background <ul><li>Ed Buech é </li></ul><ul><li>Information Intelligence Group within EMC </li></ul><ul><li>EMC Distinguished Engineer & xPlore Architect </li></ul><ul><li>Areas of expertise </li></ul><ul><ul><li>Content Management (especially performance & scalability) </li></ul></ul><ul><ul><li>Database (SQL and XML) and Full text search </li></ul></ul><ul><ul><li>Previous experience: Sybase and Bell Labs </li></ul></ul><ul><li>Part of the EMC Documentum xPlore development team </li></ul><ul><ul><li>Pleasanton (CA), Grenoble (France), Shanghai, and Rotterdam (Netherlands) </li></ul></ul>
    4. 4. Documentum search 101 <ul><li>Documentum Content Server provides an “object/relational” data model and query language </li></ul><ul><ul><li>Object metadata called “attributes” (sample: title, subject, author) </li></ul></ul><ul><ul><li>Sub-types can be created with customer defined attributes </li></ul></ul><ul><ul><li>Documentum Query Language (DQL) </li></ul></ul><ul><ul><li>Example: </li></ul></ul><ul><ul><ul><li>SELECT object_name FROM foo </li></ul></ul></ul><ul><ul><ul><li>WHERE subject = ‘bar’ AND customer_id = ‘ID1234’ </li></ul></ul></ul><ul><li>DQL also support full text extensions </li></ul><ul><ul><li>Example : </li></ul></ul><ul><ul><ul><li>SELECT object_name FROM foo </li></ul></ul></ul><ul><ul><ul><li>SEARCH DOCUMENT CONTAINS ‘hello world’ </li></ul></ul></ul><ul><ul><ul><li>WHERE subject = ‘bar’ AND customer_id = ‘ID1234’ </li></ul></ul></ul>
    5. 5. Introducing Documentum xPlore <ul><li>Provides ‘Integrated Search’ for Documentum </li></ul><ul><ul><li>but is built as a standalone search engine to replace FAST Instream </li></ul></ul><ul><li>Built over EMC xDB, Lucene, and leading content extraction and linguistic analysis software </li></ul>
    6. 6. <ul><li>Documentum Search </li></ul><ul><li>History-at-a-glance </li></ul><ul><li>almost 15 years of Structured/Unstructured integrated search </li></ul>1996 2010 2005 <ul><li>Verity Integration 1996 – 2005 </li></ul><ul><li>Basic full text search through DQL </li></ul><ul><li>Basic attribute search </li></ul><ul><li>1 day  1 hour latency </li></ul><ul><li>Embedded implementation </li></ul><ul><li>FAST Integration 2005 – 2011 </li></ul><ul><li>Combined structured / unstructured search </li></ul><ul><li>2 – 5 min latency </li></ul><ul><li>Score ordered results </li></ul><ul><li>xPlore Integration 2010 - ??? </li></ul><ul><li>Replaces FAST in DCTM </li></ul><ul><li>Integrated security </li></ul><ul><li>Deep facet computation </li></ul><ul><li>HA/DR improvements </li></ul><ul><li>Latency: typically seconds Improved Administration </li></ul><ul><li>Virtualization Support </li></ul>
    7. 7. Enhancing Documentum Deployments with Search <ul><li>Without Full Text in a Documentum deployment a DQL query will be directed to the RDBMS </li></ul><ul><ul><li>DQL is translated into SQL </li></ul></ul><ul><li>However, relational querying has many limitations…. </li></ul>Content Server DCTM client DQL SQL RDBMS search
    8. 8. Enhancing Documentum Deployments with Search <ul><li>DQL for search can be directed to the full text engine instead of RDBMS (FTDQL) </li></ul><ul><li>This allows query to be serviced by xPlore </li></ul><ul><li>In this case DQL is translated into xQuery (the query language of xPlore / xDB) </li></ul>Content Server Documentum client DQL SQL xQuery RDBMS Metadata + content search
    9. 9. Some Basic Design Concepts behind Documentum xPlore <ul><li>Inverted Indexes are not optimized for all use-cases </li></ul><ul><ul><li>B+-tree indexes can be far more efficient for simple, low-latency/highly dynamic scenarios </li></ul></ul><ul><li>De-normalization can’t efficiently solve all problems </li></ul><ul><ul><li>Update propagation problem can be deadly </li></ul></ul><ul><ul><li>Joins are a necessary part of most applications </li></ul></ul><ul><li>Applications need fine control over not only search criteria, but also result sets </li></ul>
    10. 10. Design concepts (con’t) <ul><li>Applications need fluid, changing metadata schemas that can be efficiently queried </li></ul><ul><ul><li>Adding metadata through joins with side-tables can be inefficient to query </li></ul></ul><ul><li>Users want the power of Information Retrieval on their structured queries </li></ul><ul><li>Data Management, HA, DR shouldn’t be an after-thought </li></ul><ul><li>When possible, operate within standards </li></ul><ul><li>Lucene is not a database. Most Lucene applications deploy with databases. </li></ul>
    11. 11. Lessons Learned… Structured Query use-cases Unstructured Query use-cases Fit to use-case
    12. 12. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Scoring, Relevance, Entities Hierarchical data representations (XML) Full Text searches Constantly changing schemas Relational DB technology
    13. 13. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Meta data query Transactions Advanced data management (partitions) JOINs Full Text index technology
    14. 14. Indexes, DB, and IR Structured Query use-cases Unstructured Query use-cases Fit to use-case Relational DB technology Full Text index technology
    15. 15. Documentum xPlore <ul><li>Bring best-of-breed XML Database with powerful Apache Lucene Fulltext Engine </li></ul><ul><li>Provides structured and unstructured search leveraging XML and XQuery standards </li></ul><ul><li>Designed with Enterprise readiness, scalability and ingestion </li></ul><ul><li>Advanced Data Management functionality necessary for large scale systems </li></ul><ul><li>Industry leading linguistic technology and comprehensive format filters </li></ul><ul><li>Metrics and Analytics </li></ul>xDB Transaction, Index & Page Management xDB Query Processing& Optimization xDB API xPlore API Search Services Node & Data Management Services Indexing Services Admin Services Content Processing Services Analytics
    16. 16. EMC xDB: Native XML database <ul><li>Formerly XHive database </li></ul><ul><ul><li>100% java </li></ul></ul><ul><ul><li>XML stored in “persistent DOM” format </li></ul></ul><ul><ul><ul><li>Each XML node can be located through a 64 bit identifier </li></ul></ul></ul><ul><ul><ul><li>Structure mapped to pages </li></ul></ul></ul><ul><ul><ul><li>Easy to operate on GB XML files </li></ul></ul></ul><ul><ul><li>Full Transactional Database </li></ul></ul><ul><ul><li>Query Language: XQuery with full text extensions </li></ul></ul><ul><li>Indexing & Optimization </li></ul><ul><ul><li>Palette of index options optimizer can pick from </li></ul></ul><ul><ul><li>At it simplest: indexLookup(key)  node id </li></ul></ul>
    17. 17. Libraries / Collections & Indexes = xDB segment = xDB Library / xPlore collection = xDB Index = xDB xml file ( dftxml , tracking xml, status, metrics, audit)
    18. 18. Lucene Integration <ul><li>Transactional </li></ul><ul><ul><li>Non-committed index updates in separate (typically in memory) lucene indexes </li></ul></ul><ul><ul><li>Recently committed (but dirty) indexes backed by xDB log </li></ul></ul><ul><ul><li>Query to “index” leverages Lucene multi-searcher with filter to apply update/delete blacklisting </li></ul></ul><ul><li>Lucene indexes managed to fit into xDB’s ARIES-based recovery mechanism </li></ul><ul><li>No changes to Lucene </li></ul><ul><ul><li>Goal: no obstacles to be as current as possible </li></ul></ul>
    19. 19. Lucene Integration (con’t) <ul><li>Both value and full text queries supported </li></ul><ul><ul><li>XML elements mapped to lucene fields </li></ul></ul><ul><ul><li>Tokenized and value-based fields available </li></ul></ul><ul><li>Composite key queries supported </li></ul><ul><ul><li>Lucene much more flexible than traditional B-tree composite indexes </li></ul></ul><ul><li>ACL and Facet information stored in Lucene field array </li></ul><ul><ul><li>Documentum’s security ACL security model highly complex and potentially dynamic </li></ul></ul><ul><ul><li>Enables “secure facet” computation </li></ul></ul>
    20. 20. xPlore has lucene search engine capabilities plus…. <ul><li>XQuery provides powerful query & data manipulation language </li></ul><ul><ul><li>A typical search engine can’t even express a join </li></ul></ul><ul><ul><li>Creation of arbitrary structure for result set </li></ul></ul><ul><ul><li>Ability to call to language-based functions or java-based methods </li></ul></ul><ul><li>Ability to use B-tree based indexes when needed </li></ul><ul><ul><li>xDB optimizer decides this </li></ul></ul><ul><li>Transactional update and recovery of data/index </li></ul><ul><li>Hierarchical data modeling capability </li></ul>
    21. 21. Tips and Observations on IO and Host Virtualization <ul><li>Virtualization offers huge savings for companies through consolidation and automation </li></ul><ul><li>Both Disk and Host virtualization available </li></ul><ul><li>However, there are pitfalls to avoid </li></ul><ul><ul><li>One-size-fits-all </li></ul></ul><ul><ul><li>Consolidation contention </li></ul></ul><ul><ul><li>Availability of resources </li></ul></ul>
    22. 22. Tip #1: Don’t assume that one-size-fits all <ul><li>Most IT shops will create “VM or SAN templates” that have a fixed resource consumption </li></ul><ul><ul><li>Reduces admin costs </li></ul></ul><ul><ul><li>Example: Two CPU VM with 2 GB of memory </li></ul></ul><ul><ul><li>Deviations from this must be made in a special request </li></ul></ul><ul><li>Recommendations: </li></ul><ul><ul><li>Size correctly, don’t accept insufficient resources </li></ul></ul><ul><ul><li>Test pre-production environments </li></ul></ul>
    23. 23. Same concept applies for disk virtualization <ul><li>The capacity of disks are typically expressed in terms of two metrics: space and I/O capacity </li></ul><ul><ul><li>Space defined in terms of GBytes </li></ul></ul><ul><ul><li>I/O capacity defined in terms of I/O’s per sec </li></ul></ul><ul><li>NAS and SAN are forms of disk virtualization </li></ul><ul><ul><li>The space associated with a SAN volume (for example) could be striped over multiple disks </li></ul></ul><ul><ul><li>The more disks allocated, the higher the I/O capacity </li></ul></ul>50GB and 100 I/O’s per sec capacity 50GB and 200 I/O’s per sec capacity 50GB and 400 I/O’s per sec capacity
    24. 24. Linear mapping’s and Luns <ul><li>When mapped directly to physical disks then this could concentrate I/O to fewer than a desired set of drives. </li></ul><ul><li>High-end SAN’s like Symmetrix can handle this situation with virtual LUN’s </li></ul>
    25. 25. EMC Symmetrix: Nondisruptive Mobility <ul><li>Virtual LUN VP Mobility </li></ul><ul><li>Fast, efficient mobility </li></ul><ul><li>Maintains replication and quality of service during relocations </li></ul><ul><li>Supports up to thousands of concurrent VP LUN migrations </li></ul><ul><li>Recommendation: work with storage technicians to ensure backend storage has sufficient I/O </li></ul>Virtual Pools Flash 400 GB RAID 5 Fibre Channel 600 GB 15K RAID 1 SATA 2 TB RAID 6
    26. 26. Tip #2: Consolidation Contention <ul><li>Virtualization provides benefit from consolidation </li></ul><ul><li>Consolidation provides resources to the ‘active’ </li></ul><ul><ul><li>Your resources can be consumed by other VM’s, other apps </li></ul></ul><ul><ul><li>Physical resources can be over-stretched </li></ul></ul><ul><li>Recommendations: </li></ul><ul><ul><li>Track actual capacity vs. planned </li></ul></ul><ul><ul><ul><li>Vmware: track number of times your VM is denied CPU </li></ul></ul></ul><ul><ul><ul><li>SANs: track % I/O utilization vs. number of I/O’s </li></ul></ul></ul><ul><ul><li>For Vmware leverage guaranteed minimum resource allocations and/or allocate to non-overloaded HW </li></ul></ul>
    27. 27. Some Vmware statistics <ul><li>Ready metric </li></ul><ul><ul><li>Generated by Vcenter and represents the number of cycles (across all CPUs) in which VM was denied CPU </li></ul></ul><ul><ul><li>Generated in milliseconds and “real-time” sample happens at best every 20 secs </li></ul></ul><ul><ul><li>For interactive apps: As a percentage of offered capacity > 10% is considered worrisome </li></ul></ul><ul><li>Pages-in, Pages-out </li></ul><ul><ul><li>Can indicate over subscription of memory </li></ul></ul>
    28. 28. Sample %Ready for a production VM with xPlore deployment for an entire week “ official” area that Indicates pain In this case Avg resp time doubled and max resp time grew by 5x
    29. 29. Actual Ready samples during several hour period
    30. 30. Some Subtleties with Interactive CPU denial <ul><li>The Ready metric represents denial upon demand </li></ul><ul><ul><li>Interactive workloads can be bursty </li></ul></ul><ul><ul><li>If no demand, then Ready counter will be low </li></ul></ul><ul><li>Poor user response encourages less usage </li></ul><ul><ul><li>Like walking on a broken leg </li></ul></ul><ul><ul><li>Causing less Ready samples </li></ul></ul>20 sec interval Denial spike
    31. 31. Sharing I/O capacity <ul><li>If Multiple VM’s (or servers) are sharing the same underlying physical volumes and the capacity is not managed properly </li></ul><ul><ul><li>then the available I/O capacity of the volume could be less than the theoretical capacity </li></ul></ul><ul><li>This can be seen if the OS tools show that the disk is very busy (high utilization) while the number of I/Os is lower than expected </li></ul>Volume for Lucene application Both volumes spread over the same set of drives and effectively sharing the I/O capacity Volume for other application
    32. 32. Recommendations on diagnosing disk I/O related issues <ul><li>On Linux/UNIX </li></ul><ul><ul><li>Have IT group install SAR and IOSTAT </li></ul></ul><ul><ul><ul><li>Also install a disk I/O testing tool (like ‘Bonnie’) </li></ul></ul></ul><ul><ul><li>Compare ‘Bonnie’ output with SAR & IOSTAT data </li></ul></ul><ul><ul><ul><li>High disk Utilization at much lower achieved rates could indicate contention from other applications </li></ul></ul></ul><ul><ul><li>Also, High SAR I/O wait time might be an indication of slow disks </li></ul></ul><ul><li>On Windows </li></ul><ul><ul><li>Leverage the Windows Performance Monitor </li></ul></ul><ul><ul><li>Objects: Processor, Physical Disk, Memory </li></ul></ul>
    33. 33. Sample output from the Bonnie tool ¹ Bonnie is an open source disk I/O driver tool for Linux that can be useful for pretesting Linux disk environments prior to an xPlore/Lucene install. bonnie -s 1024 -y -u -o_direct -v 10 -p 10 This will increase the size of the file to 2 Gb. Examine the output. Focus on the random I/O area: ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek- -CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)- Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2 -s 1024 means that 2 GB files will be created -o_direct means that direct I/O (by-passing buffer cache) will be done -v 10 means that 10 different 2GB files will be created. -p 10 means that 10 different threads will query those files This output means that the random read test saw 735 random I/O’s per sec at 15% CPU busy
    34. 34. Linux indicators compared to bonnie output See https://community.emc.com/docs/DOC-9179 for additional example Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sde 206.10 2402.40 0.80 24024 8 09:29:17 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 09:29:27 dev8-65 209.24 4877.97 1.62 23.32 1.62 7.75 3.80 79.59 09:29:17 PM CPU %user %nice %system %iowait %steal %idle 09:29:27 PM all 41.37 0.00 5.56 29.86 0.00 23.21 09:29:27 PM 0 62.44 0.00 10.56 25.38 0.00 1.62 09:29:27 PM 1 30.90 0.00 4.26 35.56 0.00 29.28 09:29:27 PM 2 36.35 0.00 3.96 30.76 0.00 28.93 09:29:27 PM 3 35.77 0.00 3.46 27.64 0.00 33.13 I/O stat output: SAR –d output: SAR –u output: Notice that at 200+ I/Os per sec the underlying volume is 80% busy. Although there could be multiple causes, one could be that some other VM is consuming the remaining I/O capacity (735 – 209 = 500+). High I/O wait
    35. 35. Tip #3: Try to ensure availability of resources <ul><li>Similar to the previous issue, but </li></ul><ul><ul><li>resource displacement not caused by overload, </li></ul></ul><ul><ul><li>Inactivity can cause Lucene resources to be displaced </li></ul></ul><ul><ul><li>Not different from running on large shared native OS host </li></ul></ul><ul><li>Recommendation: </li></ul><ul><ul><li>Periodic warmup </li></ul></ul><ul><ul><ul><li>non-intrusive </li></ul></ul></ul><ul><ul><li>See next example </li></ul></ul>
    36. 36. IO / caching test use-case <ul><li>Unselective Term search </li></ul><ul><ul><li>100 sample queries </li></ul></ul><ul><ul><li>Avg( hits per term) = 4,300+, max ~ 60,000 </li></ul></ul><ul><ul><li>Searching over 100’s of DCTM object attributes + content </li></ul></ul><ul><li>Medium result window </li></ul><ul><ul><li>Avg( results returned per query) = 350 (max: 800) </li></ul></ul><ul><li>Stored Fields Utilized </li></ul><ul><ul><li>Some security & facet info </li></ul></ul><ul><li>Goal: </li></ul><ul><ul><li>Pre-cache portions of the index to improve response time in scenarios </li></ul></ul><ul><ul><li>Reboot, buffer cache contention, & vm memory contention </li></ul></ul>
    37. 37. Some xPlore Structures for Search ¹ Dictionary of terms Posting list (doc-id’s for term) Stored fields (facets and node-ids) Security indexes (b-tree based) xDB XML store (contains text for summary) 1 st doc N-th doc Facet decompression map ¹ Frequency and position structures ignored for simplicity
    38. 38. IO model for search in xPlore Dictionary Posting list (doc-id’s for term) Stored fields Xdb node-id plus facet / security info Security lookup (b-tree based) xDB XML store (contains text for summary) Facet decompression map Search Term: ‘ term1 term2 ’ Result set
    39. 39. Separation of “covering values” in stored fields and summary Facet Calc FinalFacet calc values over thousands of results Res-1 - sum Res-2 - sum Res-3 - sum : : Res-350-sum Xdb docs with text for summary Small number for result window Small structure Potentially thousands of results Stored fields (Random access) Potentially thousands of hits Security lookup
    40. 40. xPlore Memory Pool areas at-a-glance xPlore Instance (fixed size) memory xDB Buffer Cache Lucene Caches & working memory xPlore caches Other vm working mem Operating System File Buffer cache ( dynamically sized ) Native code content extraction & linguistic processing memory
    41. 41. Lucene data resides primarily in OS buffer cache Potential for many things to sweep lucene from that cache
    42. 42. Test Env <ul><li>32 GB memory </li></ul><ul><li>Direct attached storage (no SAN) </li></ul><ul><li>1.4 million documents </li></ul><ul><li>Lucene index size = 10 GB </li></ul><ul><li>Size of internal parts of Lucene CFS file </li></ul><ul><ul><li>Stored fields (fdt, fdx): 230 MB (2% of index) </li></ul></ul><ul><ul><li>Term Dictionary (tis,tii): 537 MB (5% of index) </li></ul></ul><ul><ul><li>Positions (prx): 8.78 GB (80% of index) </li></ul></ul><ul><ul><li>Frequencies (frq) : 1.4 GB (13 % of index) </li></ul></ul><ul><li>Text in xDB stored compressed separately </li></ul>
    43. 43. Some results of the query suite <ul><li>Linux buffer cache cleared completely before each run </li></ul><ul><li>Resp as seen by final user in Documentum </li></ul><ul><li>Facets not computed in this example. Just a result set returned. With Facets response time difference more pronounced. </li></ul><ul><li>Mileage will vary depending on a series of factors that include query complexity, compositions of the index, and number of results consumed </li></ul>Test Avg Resp to consume all results (sec) MB pre-cached I/O per result Total MB loaded into memory (cached + test) Nothing cached 1.89 0 0.89 77 Stored fields cached 0.95 241 0.38 272 Term dict cached 1.73 537 0.79 604 Positions cached 1.58 8,789 0.74 8,800 Frequencies cached 1.65 1,406 0.63 1,436 Entire index cached 0.59 10,970 < 0.05 10,970
    44. 44. Other Notes <ul><li>Caching 2% of index yields a response time that is only 60% greater than if the entire index was cached. </li></ul><ul><ul><li>Caching cost only 9 secs on a mirrored drive pair </li></ul></ul><ul><ul><li>Caching cost 6800 large sequential I/O’s vs. potentially 58,000 random I/O’s </li></ul></ul><ul><li>Mileage will vary, factors include </li></ul><ul><ul><li>Phrase search </li></ul></ul><ul><ul><li>Wildcard search </li></ul></ul><ul><ul><li>Multi-term search </li></ul></ul><ul><li>SAN’s can grow I/O capacity as search complexity increases </li></ul>
    45. 45. Contact <ul><li>Ed Buech é </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>http://community.emc.com/people/Ed_Bueche/blog </li></ul></ul><ul><ul><li>http://community.emc.com/docs/DOC-8945 </li></ul></ul>

    ×