Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008
Upcoming SlideShare
Loading in...5
×
 

Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

on

  • 10,349 views

 

Statistics

Views

Total Views
10,349
Views on SlideShare
10,012
Embed Views
337

Actions

Likes
2
Downloads
69
Comments
0

6 Embeds 337

http://phpconference.es 200
http://phpbarcelona.org 114
http://www.slideshare.net 20
http://192.168.10.100 1
http://adnam.motd.org 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008 Presentation Transcript

  • 1. Can you be dynamic and  fast? “Miss Marple and the case of the Missing MIPS” Zoë Slattery
  • 2. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  • 3. Index and search ● Problem of finding relevant information is not new. – 3000 years BC [1] – Vannevar Bush, As We May Think, 1945. ● Today applications that search the Web must be able to provide instant  access to > 10 billion documents ● Many applications need some form of search, eg searching your hard  drive, email.... 1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3),  16­18, 2005.
  • 4. Options for information retrieval ● Search engines – Nutch, SearchBlox..... ● Information Retrieval libraries – Three with broadly similar features Implementation Language Language License language bindings ports Egothor Java None None BSD like Perl, Python, Xapian C++ None GPL PHP, Java, TCL C++, Perl,  Lucene Java None Apache 2 PHP, C#
  • 5. Lucene [2] DB Web Application File system Get user  Gather query Present search  data results User Index Search Lucene documents index Index 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
  • 6. Lucene indexing 1. Documents 2. Token stream Index  Analysis creation Oh for a muse of  fire that would  . [fire]   [ascend]  [bright]  [heaven] acsend the brightest  start end heaven of  invention..... Terms Documents fire Henry V, Scouting for boys... ascend Aerospace, Henry V... Optimise ... 4. Optimised inverted index 3. Inverted index
  • 7. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  • 8. Indexing speed Benchmark: ●17.4 MB, 814 files of PHP source code ●Linux/Thinkpad T60 Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 Java 32 3 35 Java + JIT 4 0.3 4.3 Ouch! nearly 50 times as fast in Java
  • 9. Why is the performance so  bad? First make sure we are comparing same thing: ➢ Analyser ➢ Java Lucene has many analysers ➢ Limits on terms ➢ Java stops looking at 10,000 terms ➢ Scoring ➢ Java rounds down, PHP rounds to closest ➢ Compare indexes using Luke
  • 10. Analysis ­ Java Analyzing quot;A Quick Brown Fox jumped over the Lazy Dogquot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing quot;XY&Z Corporation - xyz@example.comquot; StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  • 11. Analysis ­ PHP Analysing quot;A Quick Brown Fox jumped over the Lazy Dogquot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing quot;XY&Z Corporation - xyz@example.comquot; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  • 12. Compare indexes java Same 663 terms php
  • 13. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 14. Execution profiles ● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations ● Profiling tools (all open source) – Java ● Eclipse TPTP – PHP ● Xdebug ● KCachegrind – System ● Sysprof ● vmstat, iostat
  • 15. Java profile
  • 16. Small problems with TPTP... Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds Java + profile 687258 673851 50 Java 2.3 0.3 88 ●Invasive and slow. Takes 600,000 times as long to execute ●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts) ●Output file is machine readable only But – it's free, open source and it works enough.
  • 17. PHP profile
  • 18. No problems with this tool Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds PHP + profile 70 55 56 PHP 5 3 63 ●Not so invasive as the Java tool  but still adds to time and distorts slightly ●Results easy to display with KCachegrind ●Output file is readable
  • 19. The normalize() function Sum( ) = 2.92;   18.99 – 2.92 = 16.07 
  • 20. Micro benchmark <?php          require_once quot;Token.phpquot;;          require_once quot;LowerCase.phpquot;;          $token = new Token(quot;GOquot;, 105, 107);          $filter = new LowerCase();          for ($i=0; $i < 10000000; $i++) {                  $norm_token = $filter­>normalize($token);          }  ?> 
  • 21. normalize() opcodes compiled vars:  !0 = $srcToken, !1 = $newToken  line     #  op                   ext  return   operands  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­  11     0  RECV 1  13     1  ZEND_FETCH_CLASS :0 'Token'         2  NEW $1 :0         3  ZEND_INIT_METHOD_CALL !0, 'getTermText'         4  DO_FCALL_BY_NAME 0         5  SEND_VAR_NO_REF $3         6  DO_FCALL 1     'strtolower'         7  SEND_VAR_NO_REF $4  14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'         9  DO_FCALL_BY_NAME 0        10  SEND_VAR_NO_REF $6  15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'        12  DO_FCALL_BY_NAME 0        13  SEND_VAR_NO_REF $8        14  DO_FCALL_BY_NAME 3        15  ASSIGN  !1, $1  16    ......
  • 22. System profile 1. Convert to lower case 2. Look up opcodes
  • 23. How Xdebug works ●Convert function name to lower case ●Look up function in function table ZEND_INIT_METHOD_CALL Script execution DO_FCALL_BY_NAME Call out to profiler – start time  Execute function Call out to profiler – end time 
  • 24. The normalize() function Sum( ) = 2.92;   Is consumed in setting up  functions to be run 18.99 – 2.92 = 16.07 
  • 25. Why is function calling faster in  Java? ● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time. ● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast. ● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.
  • 26. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 27. PHP profile
  • 28. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) {          $newToken = new Token(strtolower( $srcToken­>getTermText() ),                                 $srcToken­>getStartOffset(),                                 $srcToken­>getEndOffset());         $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());      return $newToken;     }
  • 29. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); normalize() recoded.... public function normalize (Token $srcToken) { $srcToken­>setTermText(strtolower($srcToken­>getTermtext())); return $srcToken; }
  • 30. After fix
  • 31. Performance improvement? Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 PHP + fix 151 43 194 Java  32 3 35 Java + JIT 4 0.3 4.3 9.5 % improvement
  • 32. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 33. Conclusions ● Two reasons why the PHP implementation of Lucene is slow: – Function calling overhead in PHP – Inefficient code in the analyser [3] – These are the main two, there are others.... ● Dynamic and fast? – Hard to get to the same execution speed as Java – but possible to get closer. – But development speed is much better [4]– what speed to you care about? – Better not to use Java coding style (lots of methods that do nothing) ● So which implementation of Lucene should I use? – it depends..... 3. http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
  • 34. Options for PHP  Y Can  Y Y Do you  support Java  Use a Web  Use SOLR as  care about  environment? Service? web service speed? N N N N Only  No Lucene  Use Lucene via need basic  solution   a Java bridge features? today [5] Y Use Zend  Search Lucene 5. http://pecl.php.net/package/clucene
  • 35. Acknowledgements ● Rob Young's presentation [6] to the London PHP user group. ● Members of the PHP internals community, in particular Scott MacVicar,  Derick Rethans and Dmitry Stogov. 6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)
  • 36. Other useful links ●http://www.egothor.org/ ●http://xapian.org/ ●http://lucene.apache.org/ ●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html ●http://www.derickrethans.nl/vld.php ●http://lucene.apache.org/nutch/ ●http://www.searchblox.com/ ●http://www.xdebug.org/ ●http://www.eclipse.org/tptp/ ●http://www.getopt.org/luke/ ●http://www.projectzero.org ●http://www.ibm.com/developerworks/ (Publication due 24/09/08) ●http://php-java-bridge.sourceforge.net/doc/ ●http://www.zend.com/en/products/platform/product-comparison/java-bridge ●http://lucene.apache.org/solr/ ●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html