Can you be dynamic and 
         fast?
 “Miss Marple and the case of the Missing MIPS”

                  Zoë Slattery
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times


●  ...
Index and search


●   Problem of finding relevant information is not new.
     – 3000 years BC [1]
     – Vannevar Bush, ...
Options for information retrieval
●   Search engines
     – Nutch, SearchBlox.....

●   Information Retrieval libraries
  ...
Lucene [2]

                       DB
                                      Web
Application




              File
       ...
Lucene indexing
1. Documents
                                                2. Token stream                Index 
       ...
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times


●  ...
Indexing speed

Benchmark:
●17.4 MB, 814 files of PHP source code

●Linux/Thinkpad T60




                     Time to in...
Why is the performance so 
           bad?

First make sure we are comparing same thing:


 ➢   Analyser
     ➢   Java Luc...
Analysis ­ Java
Analyzing quot;A Quick Brown Fox jumped over the Lazy Dogquot;
StandardAnalyzer:
   [quick] [brown] [fox] ...
Analysis ­ PHP
Analysing quot;A Quick Brown Fox jumped over the Lazy Dogquot;
Default (lower case) filter:
[a] [quick] [br...
Compare indexes

                         java

        Same 663 terms




                                php
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     ...
Execution profiles
●   Now that we are definitely comparing the same thing, look at
    execution profiles for Java and PH...
Java profile
Small problems with TPTP...

Benchmark data:
 ● 39 files of PHP source code (php/Zend), 1.2 MB 




                      ...
PHP profile
No problems with this tool

Benchmark data:
 ● 39 files of PHP source code (php/Zend), 1.2 MB 




                      T...
The normalize() function




   Sum( ) = 2.92;  

  18.99 – 2.92 = 16.07 
Micro benchmark


<?php 
        require_once quot;Token.phpquot;; 
        require_once quot;LowerCase.phpquot;; 

      ...
normalize() opcodes
compiled vars:  !0 = $srcToken, !1 = $newToken 
line     #  op                       ext  return   ope...
System profile



       1. Convert to lower case
       2. Look up opcodes
How Xdebug works

                   ●Convert function name to lower case
                   ●Look up function in function...
The normalize() function




   Sum( ) = 2.92;         Is consumed in setting up 
                          functions to b...
Why is function calling faster in 
             Java?
●   Java is a static language. VM structures are known at start up –...
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     ...
PHP profile
look at the call to normalize()
$token = $this­>normalize(
    new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos))...
look at the call to normalize()
$token = $this­>normalize(
    new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos))...
After fix
Performance improvement?



              Time to index   Time to optimise
                                               ...
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     ...
Conclusions

●   Two reasons why the PHP implementation of Lucene is slow:
      –   Function calling overhead in PHP
    ...
Options for PHP 
                 Y           Can         Y                     Y
   Do you 
                         supp...
Acknowledgements


●   Rob Young's presentation [6] to the London PHP user group.


●   Members of the PHP internals commu...
Other useful links

●http://www.egothor.org/
●http://xapian.org/
●http://lucene.apache.org/

●http://www.tiobe.com/index.p...
Upcoming SlideShare
Loading in …5
×

Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

8,057 views
7,957 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,057
On SlideShare
0
From Embeds
0
Number of Embeds
354
Actions
Shares
0
Downloads
73
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

  1. 1. Can you be dynamic and  fast? “Miss Marple and the case of the Missing MIPS” Zoë Slattery
  2. 2. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  3. 3. Index and search ● Problem of finding relevant information is not new. – 3000 years BC [1] – Vannevar Bush, As We May Think, 1945. ● Today applications that search the Web must be able to provide instant  access to > 10 billion documents ● Many applications need some form of search, eg searching your hard  drive, email.... 1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3),  16­18, 2005.
  4. 4. Options for information retrieval ● Search engines – Nutch, SearchBlox..... ● Information Retrieval libraries – Three with broadly similar features Implementation Language Language License language bindings ports Egothor Java None None BSD like Perl, Python, Xapian C++ None GPL PHP, Java, TCL C++, Perl,  Lucene Java None Apache 2 PHP, C#
  5. 5. Lucene [2] DB Web Application File system Get user  Gather query Present search  data results User Index Search Lucene documents index Index 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
  6. 6. Lucene indexing 1. Documents 2. Token stream Index  Analysis creation Oh for a muse of  fire that would  . [fire]   [ascend]  [bright]  [heaven] acsend the brightest  start end heaven of  invention..... Terms Documents fire Henry V, Scouting for boys... ascend Aerospace, Henry V... Optimise ... 4. Optimised inverted index 3. Inverted index
  7. 7. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  8. 8. Indexing speed Benchmark: ●17.4 MB, 814 files of PHP source code ●Linux/Thinkpad T60 Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 Java 32 3 35 Java + JIT 4 0.3 4.3 Ouch! nearly 50 times as fast in Java
  9. 9. Why is the performance so  bad? First make sure we are comparing same thing: ➢ Analyser ➢ Java Lucene has many analysers ➢ Limits on terms ➢ Java stops looking at 10,000 terms ➢ Scoring ➢ Java rounds down, PHP rounds to closest ➢ Compare indexes using Luke
  10. 10. Analysis ­ Java Analyzing quot;A Quick Brown Fox jumped over the Lazy Dogquot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing quot;XY&Z Corporation - xyz@example.comquot; StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  11. 11. Analysis ­ PHP Analysing quot;A Quick Brown Fox jumped over the Lazy Dogquot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing quot;XY&Z Corporation - xyz@example.comquot; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  12. 12. Compare indexes java Same 663 terms php
  13. 13. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  14. 14. Execution profiles ● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations ● Profiling tools (all open source) – Java ● Eclipse TPTP – PHP ● Xdebug ● KCachegrind – System ● Sysprof ● vmstat, iostat
  15. 15. Java profile
  16. 16. Small problems with TPTP... Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds Java + profile 687258 673851 50 Java 2.3 0.3 88 ●Invasive and slow. Takes 600,000 times as long to execute ●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts) ●Output file is machine readable only But – it's free, open source and it works enough.
  17. 17. PHP profile
  18. 18. No problems with this tool Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds PHP + profile 70 55 56 PHP 5 3 63 ●Not so invasive as the Java tool  but still adds to time and distorts slightly ●Results easy to display with KCachegrind ●Output file is readable
  19. 19. The normalize() function Sum( ) = 2.92;   18.99 – 2.92 = 16.07 
  20. 20. Micro benchmark <?php          require_once quot;Token.phpquot;;          require_once quot;LowerCase.phpquot;;          $token = new Token(quot;GOquot;, 105, 107);          $filter = new LowerCase();          for ($i=0; $i < 10000000; $i++) {                  $norm_token = $filter­>normalize($token);          }  ?> 
  21. 21. normalize() opcodes compiled vars:  !0 = $srcToken, !1 = $newToken  line     #  op                   ext  return   operands  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­  11     0  RECV 1  13     1  ZEND_FETCH_CLASS :0 'Token'         2  NEW $1 :0         3  ZEND_INIT_METHOD_CALL !0, 'getTermText'         4  DO_FCALL_BY_NAME 0         5  SEND_VAR_NO_REF $3         6  DO_FCALL 1     'strtolower'         7  SEND_VAR_NO_REF $4  14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'         9  DO_FCALL_BY_NAME 0        10  SEND_VAR_NO_REF $6  15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'        12  DO_FCALL_BY_NAME 0        13  SEND_VAR_NO_REF $8        14  DO_FCALL_BY_NAME 3        15  ASSIGN  !1, $1  16    ......
  22. 22. System profile 1. Convert to lower case 2. Look up opcodes
  23. 23. How Xdebug works ●Convert function name to lower case ●Look up function in function table ZEND_INIT_METHOD_CALL Script execution DO_FCALL_BY_NAME Call out to profiler – start time  Execute function Call out to profiler – end time 
  24. 24. The normalize() function Sum( ) = 2.92;   Is consumed in setting up  functions to be run 18.99 – 2.92 = 16.07 
  25. 25. Why is function calling faster in  Java? ● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time. ● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast. ● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.
  26. 26. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  27. 27. PHP profile
  28. 28. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) {          $newToken = new Token(strtolower( $srcToken­>getTermText() ),                                 $srcToken­>getStartOffset(),                                 $srcToken­>getEndOffset());         $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());      return $newToken;     }
  29. 29. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); normalize() recoded.... public function normalize (Token $srcToken) { $srcToken­>setTermText(strtolower($srcToken­>getTermtext())); return $srcToken; }
  30. 30. After fix
  31. 31. Performance improvement? Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 PHP + fix 151 43 194 Java  32 3 35 Java + JIT 4 0.3 4.3 9.5 % improvement
  32. 32. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  33. 33. Conclusions ● Two reasons why the PHP implementation of Lucene is slow: – Function calling overhead in PHP – Inefficient code in the analyser [3] – These are the main two, there are others.... ● Dynamic and fast? – Hard to get to the same execution speed as Java – but possible to get closer. – But development speed is much better [4]– what speed to you care about? – Better not to use Java coding style (lots of methods that do nothing) ● So which implementation of Lucene should I use? – it depends..... 3. http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
  34. 34. Options for PHP  Y Can  Y Y Do you  support Java  Use a Web  Use SOLR as  care about  environment? Service? web service speed? N N N N Only  No Lucene  Use Lucene via need basic  solution   a Java bridge features? today [5] Y Use Zend  Search Lucene 5. http://pecl.php.net/package/clucene
  35. 35. Acknowledgements ● Rob Young's presentation [6] to the London PHP user group. ● Members of the PHP internals community, in particular Scott MacVicar,  Derick Rethans and Dmitry Stogov. 6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)
  36. 36. Other useful links ●http://www.egothor.org/ ●http://xapian.org/ ●http://lucene.apache.org/ ●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html ●http://www.derickrethans.nl/vld.php ●http://lucene.apache.org/nutch/ ●http://www.searchblox.com/ ●http://www.xdebug.org/ ●http://www.eclipse.org/tptp/ ●http://www.getopt.org/luke/ ●http://www.projectzero.org ●http://www.ibm.com/developerworks/ (Publication due 24/09/08) ●http://php-java-bridge.sourceforge.net/doc/ ●http://www.zend.com/en/products/platform/product-comparison/java-bridge ●http://lucene.apache.org/solr/ ●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html

×