Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search Lucene


Published on

Zoe Slattery's slides from PHPNW08:

The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.

For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.

In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.

Published in: Technology

Search Lucene

  1. 1. Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery
  2. 2. Agenda <ul><li>Index and search applications </li></ul><ul><li>The problem for PHP programmers </li></ul><ul><li>Understanding execution times </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Index and search <ul><li>Problem of finding relevant information is not new. </li></ul><ul><ul><li>3000 years BC [1] </li></ul></ul><ul><ul><li>Vannevar Bush, As We May Think, 1945. </li></ul></ul><ul><li>Today applications that search the Web must be able to provide instant access to > 10 billion documents </li></ul><ul><li>Many applications need some form of search, eg searching your hard drive, email.... </li></ul>1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.
  4. 4. Options for information retrieval <ul><li>Search engines </li></ul><ul><ul><li>Nutch, SearchBlox..... </li></ul></ul><ul><li>Information Retrieval libraries </li></ul><ul><ul><li>Three with broadly similar features </li></ul></ul>Egothor Xapian Lucene Implementation language Language bindings Language ports License Java None None BSD like C++ Perl, Python, PHP, Java, TCL None GPL Java None C++, Perl, PHP, C# Apache 2
  5. 5. Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user query Present search results Index Index documents Search index Gather data Lucene Application User
  6. 6. Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would acsend the brightest heaven of invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire] [ascend] [bright] [heaven] 2. Token stream
  7. 7. Agenda <ul><li>Index and search applications </li></ul><ul><li>The problem for PHP programmers </li></ul><ul><li>Understanding execution times </li></ul><ul><li>Conclusions </li></ul>
  8. 8. Indexing speed <ul><li>Benchmark: </li></ul><ul><li>17.4 MB, 814 files of PHP source code </li></ul><ul><li>Linux/Thinkpad T60 </li></ul>Java + JIT Java PHP 4 32 167 Time to index /seconds 0.3 3 43 Time to optimise /seconds 4.3 35 210 Total time Ouch! nearly 50 times as fast in Java
  9. 9. Why is the performance so bad? <ul><li>First make sure we are comparing same thing: </li></ul><ul><ul><li>Compare indexes using Luke </li></ul></ul><ul><ul><li>Limits on terms </li></ul></ul><ul><ul><ul><li>Java stops looking at 10,000 terms </li></ul></ul></ul><ul><ul><li>Scoring </li></ul></ul><ul><ul><ul><li>Java rounds down, PHP rounds to closest </li></ul></ul></ul><ul><ul><li>Analyser </li></ul></ul><ul><ul><ul><li>Java Lucene has many analysers </li></ul></ul></ul>
  10. 10. Analysis - Java Analyzing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing &quot;XY&Z Corporation -; StandardAnalyzer: [xy&z] [corporation] [] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  11. 11. Analysis - PHP Analysing &quot;A Quick Brown Fox jumped over the Lazy Dog&quot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing &quot;XY&Z Corporation -; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  12. 12. Compare indexes Same 663 terms java php
  13. 13. Agenda <ul><li>Index and search applications </li></ul><ul><li>The problem for PHP programmers </li></ul><ul><li>Understanding execution times </li></ul><ul><ul><li>Part one </li></ul></ul><ul><ul><li>Part two </li></ul></ul><ul><li>Conclusions </li></ul>
  14. 14. Execution profiles <ul><li>Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations </li></ul><ul><li>Profiling tools (all open source) </li></ul><ul><ul><li>Java </li></ul></ul><ul><ul><ul><li>Eclipse TPTP </li></ul></ul></ul><ul><ul><li>PHP </li></ul></ul><ul><ul><ul><li>Xdebug </li></ul></ul></ul><ul><ul><ul><li>KCachegrind </li></ul></ul></ul><ul><ul><li>System </li></ul></ul><ul><ul><ul><li>Sysprof </li></ul></ul></ul><ul><ul><ul><li>vmstat, iostat </li></ul></ul></ul>
  15. 15. Java profile
  16. 16. Small problems with TPTP... <ul><li>Invasive and slow. Takes 600,000 times as long to execute </li></ul><ul><li>Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts) </li></ul><ul><li>Output file is machine readable only </li></ul><ul><li>But – it's free, open source and it works enough. </li></ul><ul><li>Benchmark data: </li></ul><ul><ul><li>39 files of PHP source code (php/Zend), 1.2 MB </li></ul></ul>Java Java + profile 2.3 687258 Time to index /seconds 0.3 673851 Time to optimise /seconds 88 50 % time in indexing
  17. 17. PHP profile
  18. 18. No problems with this tool <ul><li>Not so invasive as the Java tool but still adds to time and distorts slightly </li></ul><ul><li>Results easy to display with KCachegrind </li></ul><ul><li>Output file is readable </li></ul><ul><li>Benchmark data: </li></ul><ul><ul><li>39 files of PHP source code (php/Zend), 1.2 MB </li></ul></ul>PHP PHP + profile 5 70 Time to index /seconds 3 55 Time to optimise /seconds 63 56 % time in indexing
  19. 19. look at the normalize() code public function normalize(Token $srcToken ) { $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  20. 20. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07
  21. 21. Micro benchmark <?php require_once &quot;Token.php&quot;; require_once &quot;LowerCase.php&quot;; $token = new Token(&quot;GO&quot;, 105, 107); $filter = new LowerCase(); for ($i=0; $i < 10000000; $i++) { $norm_token = $filter->normalize($token); } ?>
  22. 22. normalize() opcodes compiled vars: !0 = $srcToken, !1 = $newToken line # op ext return operands ---------------------------------------------------------------------------- 11 0 RECV 1 13 1 ZEND_FETCH_CLASS :0 'Token' 2 NEW $1 :0 3 ZEND_INIT_METHOD_CALL !0, 'getTermText' 4 DO_FCALL_BY_NAME 0 5 SEND_VAR_NO_REF $3 6 DO_FCALL 1 'strtolower' 7 SEND_VAR_NO_REF $4 14 8 ZEND_INIT_METHOD_CALL !0, 'getStartOffset' 9 DO_FCALL_BY_NAME 0 10 SEND_VAR_NO_REF $6 15 11 ZEND_INIT_METHOD_CALL !0, 'getEndOffset' 12 DO_FCALL_BY_NAME 0 13 SEND_VAR_NO_REF $8 14 DO_FCALL_BY_NAME 3 15 ASSIGN !1, $1 16 ......
  23. 23. System profile 1. Convert to lower case 2. Look up opcodes
  24. 24. How Xdebug works Script execution <ul><li>Convert function name to lower case </li></ul><ul><li>Look up function in function table </li></ul>Execute function Call out to profiler – start time Call out to profiler – end time ZEND_INIT_METHOD_CALL DO_FCALL_BY_NAME
  25. 25. The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07 Is consumed in setting up functions to be run
  26. 26. Why is function calling faster in Java? <ul><li>Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time. </li></ul><ul><li>First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast. </li></ul><ul><li>In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult. </li></ul>
  27. 27. Agenda <ul><li>Index and search applications </li></ul><ul><li>The problem for PHP programmers </li></ul><ul><li>Understanding execution times </li></ul><ul><ul><li>Part one </li></ul></ul><ul><ul><li>Part two </li></ul></ul><ul><li>Conclusions </li></ul>
  28. 28. PHP profile
  29. 29. look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) { $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }
  30. 30. look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }
  31. 31. After fix
  32. 32. Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java 32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3
  33. 33. Agenda <ul><li>Index and search applications </li></ul><ul><li>The problem for PHP programmers </li></ul><ul><li>Understanding execution times </li></ul><ul><ul><li>Part one </li></ul></ul><ul><ul><li>Part two </li></ul></ul><ul><li>Conclusions </li></ul>
  34. 34. Conclusions <ul><li>Two reasons why the PHP implementation of Lucene is slow: </li></ul><ul><ul><li>Function calling overhead in PHP </li></ul></ul><ul><ul><li>Inefficient code in the analyser [3] </li></ul></ul><ul><ul><li>These are the main two, there are others.... </li></ul></ul><ul><li>Dynamic and fast? </li></ul><ul><ul><li>Hard to get to the same execution speed as Java – but possible to get closer. </li></ul></ul><ul><ul><li>But development speed is much better [4]– what speed to you care about? </li></ul></ul><ul><ul><li>Better not to use Java coding style (lots of methods that do nothing) </li></ul></ul><ul><li>So which implementation of Lucene should I use? </li></ul><ul><ul><li>it depends..... </li></ul></ul>3. 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
  35. 35. Options for PHP Y Y Y N N N N Y 5. Do you care about speed? Use Zend Search Lucene Only need basic features? Can support Java environment? Use a Web Service? Use Lucene via a Java bridge No Lucene solution today [5] Use SOLR as web service
  36. 36. Other useful links <ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> (Publication due 24/09/08) </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul>