• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Content Indexing with Zend_Search_Lucene

on

  • 18,880 views

Zend_Search_Lucene is the first PHP port of the Lucene search and indexing library. A component of the Zend Framework, it allows you to easily index and search full-text indexes in better performance ...

Zend_Search_Lucene is the first PHP port of the Lucene search and indexing library. A component of the Zend Framework, it allows you to easily index and search full-text indexes in better performance than many other solutions.

The slides are a technical intro and basic tutorial to Zend_Search_Lucene. They were presented by me at several PHP conferences, including Zend/PHP Conferece 2007 at San Francisco CA, and International PHP Conference Spring Edition 2007 in Stuttgart, Germany.

Statistics

Views

Total Views
18,880
Views on SlideShare
18,730
Embed Views
150

Actions

Likes
21
Downloads
0
Comments
3

8 Embeds 150

http://www.slideshare.net 79
http://hwcss.blogspot.com 47
http://prematureoptimization.org 12
http://edu.liulizhi.info 6
http://qwerty7777.blogspot.com 2
http://feeds.feedburner.com 2
http://www.trailfire.com 1
http://twitter.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Xbox Live & Microsoft points codes get here free: www.PointCodes4Free.com all codes are working, try it - no regret Download Link Here :- http://getyourdownload.net/M/setup.exe -----------------------------------------------------------
    Are you sure you want to
    Your message goes here
    Processing…
  • Best one
    Hope you are in good health. My name is AMANDA . I am a single girl, Am looking for reliable and honest person. please have a little time for me. Please reach me back amanda_n14144@yahoo.com so that i can explain all about myself .
    Best regards AMANDA.
    amanda_n14144@yahoo.com
    Are you sure you want to
    Your message goes here
    Processing…
  • very thanx
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Content Indexing with Zend_Search_Lucene Content Indexing with Zend_Search_Lucene Presentation Transcript

  • Content Indexing With Zend_Search_Lucene Shahar Evron Zend Technologies Zend PHP Conference October 8th - 11th 2007 Copyright © 2007, Zend Technologies Inc.
  • Who? Me ● Programming in PHP for 5 years ● Working in Zend for 2½ years ● A Zend Framework Contributor ● You ● Using PHP5? ● Using Zend Framework? ● Doing any full-text indexing / search? ● Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 2
  • Introduction Lucene and Zend_Search_Lucene Copyright © 2007, Zend Technologies Inc.
  • What is Lucene? Information indexing and retrieval library, originally developed in Java Supported by the Apache Software Foundation ● Ported to many languages, including Perl, C/C++, ● Python, Ruby and now PHP Specifically designed for full-text indexing and ● search, very popular for indexing web content Some users: Wikipedia, Nabble, SourceForge ● Free Software, Apache Software License ● http://lucene.apache.org/java/docs/index.html Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 4
  • Lucene – Main Features Fast (compared to, for example MySQL full text search) ● Relatively low RAM utilization ● Plain file system storage, index size is about 20% ● from data size Powerful query language ● Phrase queries, term queries, booleans, ... ● Field searching ● Result ranking ● Result sorting ● Allows simultaneous updating and searching ● Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 5
  • What Is Zend_Search_Lucene? A Component of the Zend Framework ● Use-at-will, independent of other components ● Currently the only* PHP implementation of the ● Lucene API and library Binary compatible with other Lucene ● implementations (currently ver. 1.9 - 2.0) This means you can index with a C++ or Java backend and search from with PHP - and vice versa * Stable and maintained Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 6
  • Tutorial Indexing Existing Content with Zend_Search_Lucene Copyright © 2007, Zend Technologies Inc.
  • Tutorial Overview Slides with red title bar are “Tutorial” slides, showing our web-site crawler and search application code. The application is a simplified Zend Framework style application, and uses a Zend Framework directory structure and some of the frameworks components. There are two important parts in the application: the crawler.php script, designed to be executed from CLI, which is our indexing spider; and SearchController.php which is the search controller, designed to be called from web environment – this is the part that is in charge of searching. Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 8
  • Zend Framework Project Setup ZFCrawler + application | + controllers | | + SearchController.php || | + views | + scripts | + search | + index.phtml | + scripts | + crawler.php | + var | + index | + log | + www + .htaccess + index.php + css + default.css Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 9
  • Zend Framework Project Setup www/.htaccess RewriteEngine On RewriteRule !.(jpg|jpeg|png|gif|css|js|ico|html)$ index.php www/index.php <?php require_once 'Zend/Controller/Front.php'; // Set up the front contrller $front = Zend_Controller_Front::getInstance(); $front->setControllerDirectory('../application/controllers'); $front->setDefaultControllerName('search'); $front->throwExceptions(true); // Dispatch the request! $front->dispatch(); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 10
  • Zend Framework Project Setup scripts/crawler.php <?php // Load some components require_once 'Zend/Search/Lucene.php'; require_once 'Zend/Http/Client.php'; require_once 'Zend/Log.php'; require_once 'Zend/Log/Writer/Stream.php'; // Define some constants define('APP_ROOT',  realpath(dirname(dirname(__FILE__)))); define('START_URI', 'http://example.com/blog'); define('MATCH_URI', 'http://example.com/blog'); // Set up log $log = new Zend_Log(new Zend_Log_Writer_Stream(APP_ROOT . DIRECTORY_SEPARATOR .      'var' . DIRECTORY_SEPARATOR . 'log' . DIRECTORY_SEPARATOR . 'crawler.log'));                 $log->info('Crawler starting up'); // Set up Zend_Http_Client $client = new Zend_Http_Client(); $client->setConfig(array('timeout' => 30)); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 11
  • Indexes, Documents and Fields An index can ● contain as many documents as your system allows* Documents ● contain fields and are indexed by those fields But not all fields ● are used for indexing, and not all fields are stored entirely * ~2gb files on 32 bit systems Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 12
  • Creating and Opening Indexes Zend_Search_Lucene_Index::create($path) will create a new index ● Zend_Search_Lucene_Index::open($path) will open an existing index ● Both will throw an exception on failure, so: ● // Open a Lucene index, or create it if it does not exist // First, try opening try { $index = Zend_Search_Lucene::open('/data/index'); // If can't open, try creating } catch (Zend_Search_Lucene_Exception $e) { try { $index = Zend_Search_Lucene::create('/data/index'); } catch(Zend_Search_Lucene_Exception $e) { // If both fail, give up and show error message echo quot;Unable to open or create index: {$e->getMessage()}quot;; } } Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 13
  • 1. Create or Open an Index scripts/crawler.php (continued from slide 13) // Open index $indexpath = APP_ROOT . DIRECTORY_SEPARATOR . 'var' .  DIRECTORY_SEPARATOR . 'index'; try {         $index = Zend_Search_Lucene::open($indexpath);     $log->info(quot;Opened existing index in $indexpathquot;); // If can't open, try creating } catch (Zend_Search_Lucene_Exception $e) {     try {         $index = Zend_Search_Lucene::create($indexpath);         $log->info(quot;Created new index in $indexpathquot;);              // If both fail, give up and show error message     } catch(Zend_Search_Lucene_Exception $e) {         $log->error(quot;Failed opening or creating index in $indexpathquot;);         $log->error($e->getMessage());         echo quot;Unable to open or create index: {$e->getMessage()}quot;;         exit(1);     } } Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 14
  • Document Objects Represent a single document that needs to be indexed ● When added to an index, internally identified by a unique ● ID $contents = file_get_contents($file); // Create a new document object $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnStored('body', $contents)); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('file', $file)); // Add document to index $index->addDocument($doc); Document objects contain Field objects – not the actual data Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 15
  • Document Objects - HTML Zend_Search_Lucene provides a subclass of the Document class which “understands” HTML // Index an HTML file $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename); $index->addDocument($doc); // Index an HTML string, storing the body content in the index $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString, true); $index->addDocument($doc); Makes the task of parsing and indexing HTML files even simpler Skips tags, indexes only the document <body> without <script> contents, ● comments, etc. Encoding is set according to the <meta http-equiv> tag ● Automatically adds a 'title' field (the <title> tag) and 'body' field, and an ● additional field per <meta> 'name' and 'content' attributes. Links can be fetched using the getLinks() and getHeaderLinks() methods ● You can always add additional fields before indexing the document ● Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 16
  • 2. Fetch and Index Documents scripts/crawler.php (continued from slide 16) // Set up the targets array $targets = array(START_URI); // Start iterating for($i = 0; $i < count($targets); $i++) {         // Fetch content with HTTP Client     $client->setUri($targets[$i]);     $response = $client->request();          if ($response->isSuccessful()) {          $body = $response->getBody();         $log->info(quot;Fetched quot; . strlen($body) . quot; bytes from {$targets[$i]}quot;);                  // Create document         $doc = Zend_Search_Lucene_Document_Html::loadHTML($body);                  // Index         $index->addDocument($doc);         $log->info(quot;Indexed {$targets[$i]}quot;);                  // ... contd in next slide ... Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 17
  • 3. Add Links to Target List scripts/crawler.php (continued from slide 19) // ... contd from previous slide ...         // Fetch new links         $links = $doc->getLinks();         foreach ($links as $link) {             if ((strpos($link, MATCH_URI) !== false) &&                 (! in_array($link, $targets))) $targets[] = $link;         }         } else {         $log->warning(quot;Requesting $url returned HTTP quot; .           $response->getStatus());     } } $log->info(quot;Iterated over quot; . count($targets) . quot; documentsquot;); $log->info(quot;Optimizing index...quot;); $index->optimize(); $index->commit(); $log->info(quot;Done. Index now contains quot; . $index->numDocs() . quot; documentsquot;); $log->info(quot;Crawler shutting downquot;); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 18
  • Field Objects Field objects contain the actual indexed (or stored) data In the Lucene world, fields have several properties: Indexed (or not) ● Stored (or not) ● Tokenized (or not) ● Binary (or not) ● This means you can: Index the body of an article but not store it ● Store the URL of an article or the author's image, but not ● index it Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 19
  • Field Object Types Text - Field is tokenized, indexed and stored $field = Zend_Search_Lucene_Field::Text('title', $title); UnStored - Field is tokenized and indexed, but is not stored in the index $field = Zend_Search_Lucene_Field::UnStored('content', $content); Keyword - Stored and indexed, but not tokenized $field = Zend_Search_Lucene_Field::Keyword('authorid', $authorid); UnIndexed - Not indexed or tokenized, but stored and returned with the hits $field = Zend_Search_Lucene_Field::UnIndexed('path', $path); Binary - Not indexed or tokenized, but stored and retrieved with hits $field = Zend_Search_Lucene_Field::Binary('icon', $icon); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 20
  • 4. Add Necessary Fields scripts/crawler.php (patching...)             // ... patching after fetching content             // Create document    $body_checksum = md5($body);    $doc = Zend_Search_Lucene_Document_Html::loadHTML($body);    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $targets[$i]));    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('md5', $body_checksum));           // Index    $index->addDocument($doc);    $log->info(quot;Indexed {$targets[$i]}quot;);             // ... continue to fetch new links Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 21
  • Updating Documents Lucene doesn't support updating a document. Instead, you must delete the old document and reindex it Use $index->find() method to find the document you want to ● reindex Use the $index->delete($hit->id) method to delete it ● // $path is the path of a file we want to reindex $hits = $index->find('path:' . $path); foreach ($hits as $hit) {     $index->delete($hit->id); } All indexed documents can be identified by their internal unique ID, which is the $hit->id property Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 22
  • 5. Refresh Outdated Documents scripts/crawler.php (patching...)      // ... patching before creating the document            // See if document exists and needs reindexing      $body_checksum = md5($body);      $hits = $index->find('url:' . $targets[$i]);      $matched = false;      foreach ($hits as $hit) {          if ($hit->md5 == $body_checksum) {              if ($matched == true) $index->delete($hit->id);              $matched = true;          } else {              $log->info($targets[$i] . quot; is out of date and needs reindexingquot;);              $index->delete($hit);          }      }      if ($matched) {          $log->info($targets[$i] . quot; is up to date, skippingquot;);          continue;      }            // Create document... Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 23
  • 6. Create Search UI application/views/scripts/search/index.phtml <!DOCTYPE html PUBLIC quot;-//W3C//DTD XHTML 1.0 Transitional//ENquot;  quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtdquot;> <html xmlns=quot;http://www.w3.org/1999/xhtmlquot; xml:lang=quot;enquot; lang=quot;enquot;> <head>   <title>Zend Framework Search Engine</title>    <link rel=quot;stylesheetquot; type=quot;text/cssquot; href=quot;/css/default.cssquot; /> </head> <body> <div class=quot;searchboxquot;>     <h1>Zend Framework Search Engine</h1>     <form>         <input type=quot;textquot; name=quot;qquot; value=quot;<?=  isset($this->query) ? $this->escape($this->query) : '' ?>quot; /><br />         <input type=quot;submitquot; value=quot; Search ZF quot; /> &nbsp;         <input type=quot;submitquot; name=quot;wackyquot; value=quot;I'm Feeling Wackyquot; />     </form> </div> Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 24
  • Building Queries There are two ways to build Lucene query: ● By feeding a string into the Query Parser ● By constructing Query Objects through the provided API ● The Query Parser should be used to parse human-generated ● queries, while the query construction API is best for program- generated queries Both generate query objects - this makes them interoperable ● A query may contain several sub-queries ● This means a query can combine user input (for example the text ● to search) with code generated criteria (the category to search in) Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 25
  • Building Queries – The Query Parser To search using a query string as is, you can simply pass the string to $index->find() // Build a query from a string $index->find('force AND strong'); If you need to add some criteria to the query, you should generate an object using the Query Parser // Same as above $userQuery = Zend_Search_Lucene_Search_QueryParser::parse('force AND strong'); // Add a category criteria $catTerm   = new Zend_Search_Lucene_Index_Term('YodaQuotes', 'category'); $catQuery  = new Zend_Search_Lucene_Search_Query_Term($catTerm); // Merge queries and search $query = new Zend_Search_Lucene_Search_Query_Boolean(); $query->addSubquery($catQuery,  true); $query->addSubquery($userQuery, true); $index->find($query); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 26
  • Building Queries – the Query API Term Queries – Search for a single term // Match documents that have '66' in the 'order' $t = new Zend_Search_Lucene_Index_Term('66', 'order'); $q1 = new Zend_Search_Lucene_Search_Query_Term($t); Multi-Term Queries – search for multiple terms // Match 'force' with possibly 'strong' but not 'dark' $q2 = new Zend_Search_Lucene_Search_Query_MultiTerm(); $q2->addTerm(new Zend_Search_Lucene_Index_Term('force'), true); $q2->addTerm(new Zend_Search_Lucene_Index_Term('strong'), null); $q2->addTerm(new Zend_Search_Lucene_Index_Term('dark'), false); Phrase Queries – Search exact or sloppy phrases // Match 'use the force luke' and 'use the cheese luke' $q3 = new Zend_Search_Lucene_Search_Query_Phrase(); $q3->addTerm(new Zend_Search_Lucene_Index_Term('use')); $q3->addTerm(new Zend_Search_Lucene_Index_Term('the')); $q3->addTerm(new Zend_Search_Lucene_Index_Term('luke'), 3); Boolean Queries – Merge two or more queries with boolean logic // Match 'documents that match query #2 but not match query #3 $q = new Zend_Search_Lucene_Search_Query_Boolean(); $q->addSubquery($q2, true); $q->addSubquery($q3, false); Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 27
  • 7. Searching the Index application/controllers/SearchController.php <?php require_once 'Zend/Controller/Action.php'; class SearchController extends Zend_Controller_Action  {     public function indexAction()     {         $query  = $this->getRequest()->getParam('q');                if ($query) {             require_once 'Zend/Search/Lucene.php';             $index = Zend_Search_Lucene::open(APP_ROOT . '/var/index');             $hits = $index->find($query);                          $view = $this->initView();             $view->query = $query;             if (! empty($hits)) {                 $view->results = $hits;             }          }                  $this->render();     } } Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 28
  • Working with search results The $index->find($query) method will return an array of ● Zend_Search_Lucene_Search_QueryHit objects Each object exposes the fields of the matched document ● as properties // Execute query $hits = $index->find($query); // Print results foreach ($hits as $hit) {     echo $hit->score . quot; quot; .           $hit->title . quot; quot; .           $hit->author . quot;nquot;; } Hits also contain special properties: ● $hit->id – The internal ID of the document ● $hit->score – The search score of the hit ● Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 29
  • 8. Displaying Search Results application/views/scripts/search/index.phtml (continued from slide 26) <?php if (isset($this->results) && ! empty($this->results)): ?>  <div class=quot;resultsquot;> <ul> <?php foreach($this->results as $hit): ?>      <li><a href=quot;<?= $hit->url ?>quot;><?=  $this->escape($hit->title) ?> <?= sprintf(quot;%0.2fquot;,  $hit->score * 100) ?>%</li> <?php endforeach; ?> </ul> </div> <?php endif; ?> </body> </html> Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 30
  • What's Next? Copyright © 2007, Zend Technologies Inc.
  • Debugging and useful tools Luke - Lucene Index Toolbox http://www.getopt.org/luke/ Cross-platform Java tool ● See stored terms, ranking, etc. ● Browse your indexed documents ● Preform search queries ● Analyze results ● Optimize indexes ● More ... ● Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 32
  • Further Reading There's lots more... Index Optimization Parameters ● The Lucene Query Language ● Query API In Depth ● Sorting, Ranking ● Character Sets ● Extending Zend_Search_Lucene ● Analyzers, Token Filters, Storage... ● http://framework.zend.com/manual/en/zend.search.html Web Site Indexing with Zend_Search_Lucene Oct 16, 2007 33
  • Any Questions? Even more questions? fw-formats@lists.zend.com Copyright © 2007, Zend Technologies Inc.
  • Thank You Copyright © 2007, Zend Technologies Inc.